A Conversation with Zhilin Yang: Focus on Productivity, Build the One Product That Is Kimi | Z Talk

真格基金·July 30, 2024

Resist organizational inertia.

Z Talk is ZhenFund's column for sharing perspectives.

Here we discuss the latest industry observations, cutting-edge startup dynamics, and reunite with old friends from ZhenFund. We believe in continuous learning and evolution, and that the deepest insights come from practice.

In 2023, ZhenFund invested in Moonshot AI at the angel round. Founder Zhilin Yang is a top-tier AI researcher in China. He previously worked at Meta and Google Brain, and was the first author of influential papers including Transformer-XL and XLNet.

Moonshot AI is an AIGC company focused on AGI. The company has assembled top AI talent who previously worked on Google Bard, Gemini, Pangu, and Wudao, among other large models. It has released the "super-long context" Kimi intelligent assistant and the Kimi Open Platform.

After gaining early traction and buzz, what is the summit that Moonshot AI ultimately aims to reach? Perhaps we can glimpse the answer through Yang's definition of the product, his choices on technical direction, and his thinking on organizational form.

The following is a conversation between Zhang Peng, founder and president of GeekPark, and Zhilin Yang, founder and CEO of Moonshot AI.

For a long time, Moonshot AI will not launch a second product beyond Kimi.

Because the ultimate form of an AGI product is quite certain: like a person, it can solve problems for users and accompany them.

"So entertainment needs, productivity needs — there may not be clear boundaries between them. They will eventually converge in one product, though everyone's path will differ," Yang said. "We want to focus on making one product, and making it极致 [to the extreme]."

His definition of the product, his choices on technical direction, and his thinking on organizational form all lead toward that ultimate goal.

"Is your entrepreneurial state more like climbing a mountain, or sailing the seas?"

"More like climbing a mountain," Yang answered. "We keep saying internally that we need to climb the stairs, not just look at the scenery."

01

Revisiting Kimi:

Greater Focus, Resisting Organizational Inertia

Q: Since starting the company, how do you evaluate Kimi's development? What score would you give your team, company, and product?

Zhilin Yang: Yes, time has indeed flown by. I'd estimate about 60 points.

From our perspective, this industry as a whole is definitely still a marathon. Technology development may be relatively fast, but if you look at the overall普及 [popularization] of technology, including the product and commercialization as a whole, I think we definitely need to view this on a 10-to-20-year timescale.

Over the past year, the one thing we've basically been doing is exploring early PMF (Product Market Fit), and then iterating on some technical model progress so that the model can better serve user needs.

But I think at the same time there are still many challenges. The most critical among them is that, for example, Scaling Law is still very early today. We may still need to see how we can Scale to the next-generation model, the generation after that, and through this form a stronger PMF in the product — truly achieving higher penetration in everyone's work and life. I think this actually still has very many challenges; it's probably a more long-term thing.

Q: Relatively speaking, you entered somewhat late, but from the outside we perceived that you accelerated very quickly and now have strong momentum. What do you think you've done relatively successfully? To have reached this current objective effect, which choices do you think were relatively correct?

Zhilin Yang: Really good question. I think in a strict sense, we definitely can't say it's very successful or anything; perhaps there have been some small beginnings.

The thing we've been continuously focused on, and are still continuously optimizing, is still hoping to truly solve problems from the fundamentals, to truly focus on first principles. Just as, for example, the foundational assumption of the Personal Computer industry was Moore's Law, I think the foundational assumption of the AI industry is still the scaling law — Scaling Law.

If we look at this problem from a 10-year perspective, I think it's more about: how can we continuously optimize from the perspective of technology and model effectiveness, and then in this process form a stronger PMF. So many actions will start from this point of departure to plan what exactly needs to be done. We hope our biggest investment, as well as time and energy, should be placed on how we can iterate out better models, thereby unlocking more scenarios — this is probably what we care about most.

Of course, this may also mean focus. On both technology and product, we hope to be more focused. For example, we now focus more on serving knowledge workers — such as academic researchers, internet practitioners, content creators, financial analysts, legal professionals, and others — in productivity scenarios. We've already become an assistant for tens of millions of knowledge workers to search for materials, analyze documents, and create content. We won't do too much outside productivity scenarios for now, because if you do everything, it may be hard to do anything well in the end.

Q: This is also a trade-off; you have to make some choices.

Zhilin Yang: Yes, I think startups still need to have relatively clear priorities. For example, we may focus on doing极致 [extreme] optimization for productivity scenarios. In many cases, the product looks like just a box — it seems like nothing has changed — but a lot of the experience behind it has actually been optimized a lot. Of course, there is still very, very much room now.

In this process of making choices, it often means needing to cut some things. Not everything needs to be done, because I think organizational inertia still wants to do more and more things. We may need to resist this gravity, hoping to do fewer things but do them to the extreme.

This applies to technology as well. The AI space is very large, because intelligence itself is a very heterogeneous — very heterogeneous — thing. An accountant's intelligence is completely different from a painter's intelligence, or a mathematician's intelligence. Within this, we also look at what kind of intelligence might be the current priority based on our core user group, what its corresponding foundational capabilities might be, and then focus more on doing that.

Q: If one thing has been able to achieve today's effect, perhaps a relatively important reason is that you were quite focused, choosing this productivity dimension. In the process of trade-offs — for example, you must have looked at the chat/companionship product direction, and your team probably discussed it — what was the logic behind ultimately choosing not to do it?

Zhilin Yang: Yes, we did discuss this question. The main points were probably a few. One was considering what we ultimately want to make. Because what we want to make is general intelligence.

Ultimately, casual chat scenarios and productivity scenarios will probably converge in the same product; it's just that the path choices will differ. The reason we do productivity first is that productivity improves intelligence quotient (IQ) faster. Today if you make a product similar to Character.ai, the vast majority of energy is not spent on optimizing IQ, because optimizing IQ probably doesn't help much with improving product retention. But if you do productivity, after optimizing IQ, you can see significant improvement in retention.

The company's mission and the product's roadmap should be able to combine more tightly — this is a very important reason.

Of course there may be other reasons. For example, we also observed different choices and development situations in the US market. The US market as a whole is about one to two years ahead of the Chinese market, so we can look at the development of different companies. Companies that do productivity particularly well today — whether from business scale, or from fundraising and talent attraction — are actually doing better.

Another very important reason: I think the baseline for entertainment scenarios today is very high. The development of mobile internet over the past decade has given birth to a bunch of very good entertainment experience products. But in the productivity dimension, the foundational experience or value still has very large space to be挖掘 [excavated/mined]. Even the best productivity products today, I think, have not yet deeply penetrated into real workflows, and this is a huge new variable that AI can bring. So this is probably an important reason why we made this choice.

02

Kimi's Goal Is

Your Ultimate Companion

Q: How do you define the Kimi product? What problem is it solving? In the long term, is it an AI workbench or what?

Zhilin Yang: I think this may be divided into short-term and long-term. In the short term, we hope to provide more and more intelligence in productivity scenarios. For some of people's most main tasks today — such as better information acquisition, information analysis, creation, etc. — we hope it can deliver greater value.

In the most long-term, ideal situation, we're essentially discussing what the ultimate form or ultimate definition of an AGI product is. There has been discussion about this; I think there may be two different views now.

One is that it's another me in the world. It has all your inputs, its ideas are basically the same as yours — it's essentially a replica of yourself. This "other self" can do many things in the digital world, even the physical world.

Another definition is that it may be a companion of yours — a long-term, even接近 [approaching] lifelong companion. This companion can also help you do many things, but it may be different from you, may propose new perspectives to you. It's not completely replicating you, but maybe something like...

Q: It understands me, rather than replicates me.

Zhilin Yang: Very much understands you. I now think the probability of the second one may be greater, so this is probably what we want to make.

I think it will have several relatively important characteristics.

The first: I think it still has to be useful first — you can do increasingly complex things. I think what can be done today is still not enough; this is also why our current focus is still on further improving model capabilities. Because only by improving model capabilities can we let it do more things.

If we analogize it to a person, it's actually still missing very many dimensions. It may have no memory, no way to do long-term planning. In my imagination, if this product is done well, it's not just completing tasks that take 10 or 20 seconds. Rather, you could let it set a quarter's OKRs, and it could complete them on its own.

Then the second important characteristic: I think there can still be very long-term trust and connection established between AI and humans. But the prerequisite for this is still the first step — it has to be useful enough. If the AI gives you hallucination every day, gives you many wrong conclusions, trust is also hard to build. I think only by providing authenticity and accuracy in increasingly complex, long-context tasks can trust be built.

So we consider the product from these aspects.

Q：From that perspective, what we're seeing today is probably just an early, rudimentary form of the product. That also explains why we keep seeing new features being added — fundamentally, you're fighting to win a more useful position in the user's life. That position will likely expand from handling a subset of tasks to more and more over time. That seems like the path forward.

Zhilin Yang: I think there's an important milestone here: the day you realize AI is doing more than you are at work — when the AI share crosses 50% — that could be a really significant milestone.

The next milestone after that might be robots outnumbering humans, but that's in the physical world. I think the digital world will get there much earlier.

Q: So I'm curious — from Moonshot AI's perspective, you have a successful product that everyone's using today, and there may be new features down the line that solve more concrete user problems. Would you lean toward solving everything within the Kimi platform, or could there be room for other, more specialized apps to emerge?

Zhilin Yang: We'll definitely stay focused on just one app, Kimi. Because there's an important principle here: future intelligent products should serve universal needs. As I mentioned earlier — entertainment needs, productivity needs, and maybe the boundary between these two isn't even that clearly defined.

I think they should live within the same product; it's just a matter of different paths to get there. That's what I find most interesting about general intelligence — it's not limited to just one thing. But of course you can't do everything from day one. You'll still start with key scenarios and core user groups, then gradually generalize outward.

So we want to stay focused on one product and push it to the extreme.

Q: It sounds like for Kimi to reach that ultimate goal you described — a companion that truly understands the user — the starting point is helping the user do one or two things well, then gradually more and more, and through that the partnership and trust gets built. That's a worldview and roadmap.

Zhilin Yang: Yes. I think an important marker in this process is AI evolving from handling single, concrete tasks to completing tasks that would take a human several weeks — in other words, comprehensive Super Intelligence.

Today's AI can already achieve partial Super Intelligence, like reading long texts. Humans can't read millions of words at once and directly find answers to questions, because many problems can't be solved through "searching." This was a pitfall we fell into early on — we hired people to annotate long-text data, thinking once annotated, humans would learn. But you realize humans simply can't annotate it, or the efficiency is extremely low.

So today AI already surpasses humans in some capabilities, and what we need is to gradually expand that scope.

03

Long-Context Is Fundamentally

"Long Inference"

Q: Speaking of which, long-context capability was one of the first things that made Kimi stand out, and at the time Moonshot AI was somewhat contrarian in insisting this mattered. We've seen many models supporting longer context lengths since then — is this gradually becoming consensus? And is long-context the most important path toward solving that ultimate problem you described?

Zhilin Yang: Great question. Many people are indeed working on long-context now, but if we're talking about full consensus, I'm not entirely sure. There are differing views — for example, some believe there's still much to mine in short-context scenarios, and long-context may not need to be rushed.

But I think that's normal; every company has its own judgment. For us, we recognized relatively early that long-context could be critical, for several reasons:

First, if we want AI to evolve from completing tasks that take a minute or two to handling long-horizon tasks, it necessarily needs to operate within a very long context to push further.

But this is a necessary, not sufficient condition. Having long-context capability may also require strong reasoning ability, which then enables even longer context... I think it's somewhat like a spiral iteration process.

A more precise term might be long inference — the ability to perform good reasoning within a very long window. I think that's what ultimately creates massive value.

Q: So ultimately it's about the conversion ratio between the data input, instruction input, and final service delivery to the user — the higher that conversion ratio, the greater the value.

Zhilin Yang: Right, that's quite interesting. Even looking just at long-context technology, its real-world deployment has been a process of gradually improving that ratio.

In the earliest days, long-context was used for reading tasks. Reading was the first to land because it's a process of turning lots of information into very little, which is relatively easier.

For example, having AI read 10 articles and produce a summary — that's definitely less difficult than giving AI a simple instruction and asking it to work for a month while satisfying user needs.

It really is a ratio question, the ratio of input to output. That may be the more fundamental thing.

Q: Understood. Do you think long-context costs will drop rapidly? Because right now, running 2 million tokens through really is quite expensive, and this ties to which scenarios long-context can be applied to and whether it can solve high-value problems — these two questions are bound together. What's your view?

Zhilin Yang: I think continuously declining costs is definitely an inevitable trend. Recently we've developed some new technologies — on one hand, extreme engineering optimizations like context caching. On the other hand, we've done a lot of architectural optimization. This can reduce costs by more than an order of magnitude from where they are now.

So for a 2-million-character window that most people can use affordably, I think that's a goal we can likely achieve this year.

Q: This year?

Zhilin Yang: Yes, I think that's our goal. For a considerable period going forward, costs will keep dropping — and faster than short-context costs.

There's still much untapped potential. For example, when a human handles a very long memory or long-horizon task, they don't actually need to remember everything, right? It's a dynamic computation process — you choose what to keep and what to discard. There's massive optimization space there.

And since AI's efficiency far exceeds human efficiency today, its optimization space is huge, so the overall cost reduction speed will outpace short-context.

Q: Let's imagine from this angle — how will people use this capability in application scenarios? Before, we'd throw a book in and ask for a summary, which is very straightforward. Following your reasoning, what user scenarios will become more viable next?

Zhilin Yang: This connects to what we were just discussing. It's a process of adjusting the input-output ratio. Initially it's reading, which is currently one of the most刚需 [rigid demand] scenarios. After that, it may evolve into the model's ability to reason and plan within a long window and execute multi-step tasks.

For example, if you want to research a topic today, or even just give AI a clear goal, it can execute multi-step planning, call different tools, and even have intermediate thinking and analysis processes. I think it will gradually evolve in this direction.

Of course, it's also important for multimodal. For instance, if you want to generate a long continuous video today, that likely also requires good context technology behind it.

Q: So now I understand why you said long-context may increasingly resemble long inference. Fundamentally it's not the traditional sense of "I give it this much text, it processes this much text," but rather "how long is its inference capability, how much information can it reason and create upon" — that becomes more important.

Zhilin Yang: Right, because if you only have long text but not enough brainpower (insufficient reasoning ability), there isn't much value. You definitely need both to be strong simultaneously.

Q: It will transform from a product feature into a backend capability, and that capability will produce more powerful features — that's the progression, right?

Zhilin Yang: Right, and this is also a process of exploring together with users. For example, many of the scenarios in our context today weren't anticipated on Day One.

Even for reading, we hadn't expected it could be used for quickly getting started in a new domain, or that across different industries there might be different uses. Some users might use this feature for analysis, but if you don't provide context, the analysis isn't as good. When you do provide it, the analysis becomes more structured, more like a McKinsey-style analysis.

So I think this is a co-creation process with users — you'll constantly discover new application scenarios.

Q: Right, that's the incremental value that intelligence brings.

04

Multimodal Unification Is

General Intelligence

Q: Looking at recent technological shifts in the industry — Sora and GPT-4o that we see today — how do you view Sora's video generation capability? Will this be a capability that Kimi particularly values in the future?

Zhilin Yang: This is definitely important. Because for general intelligence, it must be multimodal — it's hard to imagine a unimodal general intelligence. So I think ultimately different modalities will definitely converge into a unified model. Of course, we can see technological development along two different dimensions right now.

The first dimension is continuously rising intelligence. Looking at Sora and GPT-4o, their intelligence improvements exist but aren't dramatically significant. If you ask them to do IQ tests or more complex tasks, they still can't do them. So this direction definitely needs continued investment — I think it's the most important direction.

The other dimension is continuously expanding modalities. For example, video modality and voice modality now, and potentially perception data, action data, even robot modalities in the future. The value here is that (the model) can complete more scenarios, provide richer interaction methods, and help products cross the chasm — making the technology truly easy to use and adopted by more and more people.

These are two different dimensions, but ultimately they will unify.

Q: Multimodal capabilities like GPT-4o are definitely being researched by all model companies. But for video generation technology like Sora, does it sit on the intelligence growth line? Or is it more about delivering service to users? What makes this line important?

Zhilin Yang: Actually this question was already debated in the era of pure language models.

I remember that between 2019 and 2020, there was an important debate: should language models focus on understanding or generation? At first you had models like BERT, and later the GPT series. GPT might have been better at generation, but BERT was always more efficient with the same compute — meaning greater benchmark gains per unit of compute.

During that period, everyone was focused on BERT, thinking that understanding alone was sufficient, that most industrial value lay in understanding. But this overlooked something crucial: if you want to do understanding really well, you actually need to do generation really well. These two problems are ultimately one problem.

It's the same for video. Today we want to do great video generation partly because video generation itself has high value, especially for content creators and users. But I think what's more important is this: if you can optimize the generation objective function extremely well, it will ultimately lead to better understanding as well.

I think text has already taught us a major lesson. Over the past few years, there was lots of debate at first, but it basically became consensus: understanding and generation cannot be separated. It's hard to train a purely understanding-focused model. In the end, these two will probably just be one model.

Q: Lately we've seen academic discussions suggesting that Scaling Law and the Transformer architecture may lead to the future, yet some leading scientists express less confidence, arguing that fundamental changes are still needed. This leaves us outsiders somewhat puzzled. Zhilin, as someone who went from young scholar to founder today — with scholarly and entrepreneurial worldviews that differ — how do you reconcile these perspectives? How do you view the judgments coming from academia?

Zhilin Yang: I think it works like this: academia's job is to find the correct first principles. Industry builds on first principles to execute as well as possible. But execution doesn't mean pure execution — it may require significant innovation, just at different levels.

First principles represent innovation at the most fundamental level. So academic debates center on questions like: is Scaling Law correct? Is Next Token Prediction correct? I think these questions are all meaningful, deserve discussion, deserve challenge, deserve fresh perspectives. Everyone will have different views. After all, neural networks weren't getting much attention three or four decades ago, or even two or three decades ago — people didn't think it was a promising technical direction.

I think this is academia's greatest value. Industry's value, or what it needs to do, is to solve the most important problems within a given technical direction or first principle. For example, even if Scaling Law is a first principle, many problems remain unsolved along the way: how to generate data, how to build multimodal models, how to create data flywheels? These all need solutions, but this is never about inventing a new first principle — they work and innovate at different levels. That's my understanding. Academia probably needs more debate, needs people to propose new challenges and ideas. But industry is about how to solve, faster and better, the major technical challenges at the second level, building upon first principles.

But I don't think the overall conflict is that large right now. For example, Yann LeCun has been talking about world models, and today's large language models are actually a special case of world models. So I don't think there's that much conflict. For us, it's about exploring the limits of intelligence within the Scaling Law framework. But as human technology advances, there will always be new technical directions proposed — though I think this belongs more to the mission of pure academic research, a different level entirely.

05

For startups,

dynamic reaction matters more than long-term prediction

Q: When we talked last year, you mentioned that in the large model era, startups need not just product and technical innovation but possibly organizational innovation too, because building products today involves far more system variables than before — models, data, users, and so on. A year later, do you have any concrete results on the organizational innovation front?

Zhilin Yang: I think this remains an ongoing process, because organizations themselves need time to grow.

Often when we see American companies moving faster than us, it's partly because their overall AI capabilities are stronger, but also because they've invested enormous time in building their organizations. Not just hiring the best people, but developing mechanisms that let these people innovate within this paradigm. Chinese companies often started a bit later. There are two different types of companies here: ones that pivoted from other businesses to this new one, where the new business may need a different organizational approach; and others that are truly from 0 to 1, which may carry less organizational debt but still need to explore what works. So overall I think we still need more time. We've made some progress, but there's definitely still huge room for improvement.

Q: This is something definitively important, but it needs longer-term study.

Zhilin Yang: Right, because when we think about technology, we're fundamentally looking at how technology gets produced. It's produced by people, people working with productive materials.

Q: So people are the first principle of technology, or rather, human organization is the first principle behind technology.

Zhilin Yang: Yes, I think so. So we pay close attention to how we can recruit the best talent? Especially on the technical side. These are foundational to doing technology well.

Q: So does recruiting take up a significant portion of your time right now?

Zhilin Yang: Yes, the overall proportion is quite high, because this remains our core fuel for growth.

Q: What kind of people do you personally spend the most time recruiting?

Zhilin Yang: Right now the main focus is still technical talent. This relates to company priorities — for us, the most important thing is still doing technology well, because only with better technology can we unlock more product scenarios, achieve better retention, better commercialization. Everything is built on doing technology better. So while we already have some good people in this area, we definitely need to keep strengthening and attracting even better people to join us.

Q: Over the past six months or so, or since you started the company more than a year ago, what have you gotten right and what have you gotten wrong?

Zhilin Yang: I think prediction is generally very hard. So what's more important is rapid adjustment. Because AI develops so quickly, it's often difficult to predict. For example, what will models be capable of next year? This is extremely hard to answer. You may have some sense, some judgment, but what matters most is reacting to new variables. These variables may come from the market, from new iterations after extensive experimentation, or from user feedback. In any case, reacting very quickly to new variables is probably the most important thing.

If I had to name predictions, I think a few trends have played out roughly as we initially expected. For example, context length has kept increasing, and video generation capabilities — being able to generate minute-level video — this trend has matched our expectations.

But some timing judgments weren't quite as accurate. For instance, Sora launched earlier than we expected. Though perhaps not actually earlier, since achieving true Product-Market Fit may still take some time.

Q: Because we haven't gotten to use it yet.

Zhilin Yang: Right, because we haven't used it. Somehow, truly reaching PMF may still take time. Most of the current intelligence gains probably come from this generation of models like GPT-4, with better Post Training. But GPT-5 seems to be on a later timeline than originally predicted. So accurate timing predictions, I think, are very difficult.

Q: What I'm hearing is that you don't focus much on distant predictions. You'd rather develop continuous, rapid, effective reasoning — taking the next action as soon as each variable emerges, rather than betting on something far out.

Zhilin Yang: Dynamic, rapid reaction — this is also something we as a small startup can do better.

Q: Rather than placing a big bet on something far away.

Zhilin Yang: You certainly need some directional judgment, achieving long-term certainty and walking firmly toward a goal — this affects whether your execution can be done well. But for planning at the month-level scale, I think allowing some flexibility reduces your probability of making mistakes.

06 Climbing a Mountain, or Sailing the Seas?

Q: I'll end with a slightly whimsical question — I'd like your intuitive answer first, don't reason through it immediately. Use system one first, then system two. Do you feel your work today, your entrepreneurial state, is more like climbing a mountain or sailing the seas?

Zhilin Yang: Probably more like climbing a mountain.

Q: Okay, now let's reason through it — why more like climbing a mountain?

Zhilin Yang: My first reaction was mountain climbing because we've always felt we're climbing stairs, not sightseeing. This is something we constantly say internally, so it was my first reaction. But thinking with system two now — I've never actually sailed, but I imagine sailing as being on an ocean where even after hundreds of kilometers, what you see is basically the same. Your goal is clear, but what you see...

Q: Progress isn't clear.

Zhilin Yang: Yes, without good positioning technology.

Q: Your frame of reference isn't clear.

Zhilin Yang: Right, you can't clearly see different things. Nothing around you changes, and that loneliness would be stronger. But with mountain climbing, every step you feel yourself improving. You can feel that the model's capabilities are genuinely a bit better than a few months ago, your retention is a bit higher than a few months ago. Your view is different. When you measure your distance to the goal, you have a better sense of it. So I think it's something with clearer progress.

That's one side of it. On the other hand, I think AI development is also a gradual, step-by-step process. You might go from 10^24 to 10^25, 10^26, 10^27. Even assuming everyone is at 10^25, your training efficiency can keep improving — you can get more intelligence out of every unit of compute. It's like taking a few more steps up the mountain. So overall, it feels more like climbing.

Source: Founder Park

01