Moonshot AI Founder Zhilin Yang's Latest Take: Deep Reflections on OpenAI's o1 Paradigm Shift | Z Talk

真格基金·September 19, 2024

The Next Phase of Foundation Models: A New Paradigm?

Z Talk is ZhenFund's column for sharing ideas and perspectives.

In 2023, ZhenFund led the angel round in Moonshot AI. Founder Zhilin Yang is one of China's top AI researchers. He previously worked at Meta and Google Brain, and was first author on landmark papers including Transformer-XL and XLNet.

Moonshot AI is an AIGC company focused on AGI. It has assembled top AI talent who previously worked on Google Bard, Gemini, Pangu, and Wudao, among other large models. The company has released the "ultra-long context" Kimi intelligent assistant and the Kimi Open Platform.

When OpenAI released o1, it reignited industry-wide discussion about how large models can break through data and compute bottlenecks. At a recent sharing session at Tianjin University's Xuanhuai College, Zhilin Yang offered his answers and reflections.

Here are the key takeaways from that session:

After scaling laws, the next paradigm for large model development is reinforcement learning.
The release of OpenAI's o1 model uses reinforcement learning to attempt breaking through the data wall, with a visible trend of compute shifting more toward inference.
What determines the ceiling of this generation of AI technology is fundamentally the ceiling of text model capabilities.
AI product capabilities are determined by model capabilities — this is fundamentally different from the internet era. If the model isn't strong, the product experience won't be good.
The super-app of the AI era will most likely be an AI assistant.

By Guo Xiaojing, Tencent Tech

Edited by Zheng Kejun

OpenAI's release of o1 has once again sparked industry discussion about new paradigms for large model evolution.

The focus of debate centers on two widely acknowledged bottlenecks: the data bottleneck — we're running out of data; and the compute bottleneck — 32,000 GPUs is currently the ceiling.

But the o1 model seems to have found a new path. It employs reinforcement learning, attempting to overcome these limitations through deeper thinking and reasoning, improving data quality and computational efficiency.

On whether this new paradigm can push large model competition into a new phase, Moonshot AI founder Zhilin Yang has some fresh, in-depth thoughts.

Last week, Yang gave a talk at Tianjin University's Xuanhuai College. Tencent Tech, as a media partner, was first to synthesize his sharing.

Yet no one can precisely predict how the industry will evolve. On the path of innovation, what's more often needed is the boldness to experiment and the courage to repeatedly face failure.

At the end of his talk, Yang quoted Thinking, Fast and Slow author Daniel Kahneman:

"Often you're willing to work on something you don't know, because you don't know how much you don't know — that's what gives you the courage to do it. When you do it, you'll discover many new problems, and perhaps that is the meaning of innovation."

Below is the transcript of the sharing (edited for length):

Today I'll mainly share some thoughts on the AI industry's development.

Artificial intelligence has developed for over seventy years, passing through many stages. From 2000 to 2020, AI was mainly concentrated in vertical domains — giving rise to many companies in facial recognition, autonomous driving, and so on. At their core, these companies were working on vertical tasks, built for specific purposes.

Labor-intensive, customized systems. That was the core paradigm of AI before: "You reap what you sow — want a watermelon, plant a watermelon. You'll never get beans from melon seeds."

This paradigm has changed dramatically in recent years. Instead of training highly specific AI models, we're now training general intelligence.

What's the benefit of general intelligence? The same model can be applied across different industries and different tasks, with greatly improved generalization, so the potential space is enormous.

If it eventually reaches human-level performance across many domains, it could create leverage for societal GDP, because everyone's productivity would increase. What originally produced one unit of productivity could now, with general AI helping you do all kinds of tasks, multiply by 1.x, or even two times, ten times — depending on how far general intelligence develops.

01 Three Factors Behind General Models

Why have general models suddenly emerged in recent years? I think it's both inevitable and contingent. Inevitable in that human technological development would eventually reach this point.

But contingent because it happened to satisfy three factors:

First, over twenty years of internet development provided massive training data for AI. The internet essentially digitized the world and human thought — turning what every person produced and every idea in every person's mind into data.

This is quite coincidental. I imagine when people started building internet products like search engines or portals around 2000, they never imagined this data would one day contribute to the next generation of human civilization. In the tech tree of development, the internet was a prerequisite node for AI.

Second, many technologies in computing were also prerequisite nodes for AI. For instance, to get a sufficiently intelligent model requires on the order of 10^25 FLOPs of computation.

But to complete that many floating-point operations simultaneously in a single cluster, within a controllable timeframe — that was impossible ten years ago.

This depended on chip technology development, networking technology development. Not just chips computing faster, but connecting chips together, with sufficient bandwidth, sufficient memory — all these technologies stacked together to compute to 10^25 in two or three months.

If it took two or three years to compute 10^25, we might not be able to train current models, because the iteration cycle would be so long. Each training failure might mean waiting several more years, so you'd only be able to train models one or two orders of magnitude smaller. But one or two orders of magnitude fewer floating-point operations wouldn't produce the intelligence we have now — this is what the so-called scaling law dictates.

Third, algorithmic improvements. The Transformer architecture was invented in 2017. Initially it was a translation model, somewhat specialized. Later many people expanded it toward more general concepts, and eventually everyone discovered Transformer is a highly general architecture.

No matter what kind of data, no matter what you're trying to learn, as long as it can be expressed digitally, Transformer can learn it. And this generality manifests in excellent scaling properties.

With a more traditional architecture — say recurrent neural networks or convolutional neural networks — you might find that at 1 billion parameters or more, adding parameters or compute stops helping. But for Transformer, as long as you keep adding, it keeps improving, with almost no visible ceiling. This architecture made general learning possible. Just keep feeding data into the model and defining your learning objective.

These three things combined produced the general models we see today — and all three were indispensable.

We find it remarkable: human technological development always stands on the shoulders of predecessors.

There's a book called The Nature of Technology — highly recommended! Technology basically evolves through combinatorial processes. Each generation of technology can be seen as a combination of several previous generations. But some combinations produce far greater power than others — the three I just mentioned form an extraordinarily powerful combination that produces general models. Yet before OpenAI, perhaps no one imagined these three things combined could generate such immense power.

02 Three Layers of AGI Challenges

Given these three elements, I think for AGI there are three layers:

The bottom layer is scaling laws — the first level of innovation opportunity, discovered and pushed to the extreme by OpenAI.

The second level of innovation opportunity: within the scaling law framework, there are unsolved problems — for instance, how to put all modalities into unified representations within the same model? This is the second-level challenge.

Meanwhile, although the internet has developed for over twenty years, data is ultimately finite. The total accumulated data still isn't enough. Now everyone faces a problem: the data wall. There's no more data to train on.

Let me give an example. Suppose we want to build an AI with strong mathematical ability. The question we should ask is: what data would help me learn mathematical ability? Existing digitized math problems are quite few, and most data on the internet isn't related to math.

Now the good data has been largely exhausted. It's hard for any individual or company to say, today I can find ten times more data than the internet to train with. So we hit the data wall. Solving second-level problems would yield second-level opportunities, or returns.

Third-level problems: for example, being able to handle longer contexts, having stronger reasoning or instruction-following capabilities — these are third-level problems.

The bottom layer is first principles. With first principles, it's a 0 versus 1 essential difference. Above first principles, there are many second-level problems where core technologies need to be solved. Many people are now working on second-level core technologies. Doing this well can make technology go from merely feasible to highly usable, and at large scale.

If you look at steam engine development, it was the same. First the theorem was discovered, first principles established. But in the process of deploying steam engines, initially the power wasn't good enough, or costs were too high. Basically all new technologies face these two problems when they emerge.

We just mentioned an important problem: the data wall. In this situation, according to first principles, we need to keep training larger models, keep adding more data — so there's a conflict here.

Natural data has been exhausted. How can we add more data? How can we keep scaling? This involves a paradigm shift.

What we did before was simple: just predict the next token. This itself contains a lot of reasoning, knowledge.

For example, take the sentence "The municipality closest to Beijing is Tianjin." The language model takes the preceding text as input and predicts whether the final word is Tianjin or Chongqing, etc. It makes predictions. With enough predictions, it knows it's Tianjin. Through this prediction, knowledge is absorbed into the model — it learns knowledge.

Another type of task: say you read a detective novel, read the first nine chapters, and in the final chapter need to predict who the murderer is. If you can correctly predict the murderer, it's still next-word prediction as described. Suppose there's a sentence where after reasoning for a while you discover the murderer is someone — then the model has learned reasoning.

With enough such data, it learns reasoning. It can learn reasoning, learn knowledge, learn many other tasks. If you take all searchable data and have it keep predicting the next word, its intelligence keeps rising, reasoning ability keeps strengthening, knowledge keeps growing.

Here we can distinguish three different types of things that can be learned:

First, considering very low entropy situations — some factual things, knowledge itself has no entropy, entropy level is extremely low, so it's directly memorized.

Second, reasoning processes. Like the reasoning process in a detective novel has medium entropy — there may be multiple reasoning paths that arrive at the same result.

Third, say some creative tasks. If you want to write a novel now, it's not deterministic; its entropy is very high.

These different things can all be learned within the same framework through the single objective of predicting the next word. Just doing this one thing enables learning all of them — this is the foundation of general intelligence. Putting all these things in the same thing to learn, without having to distinguish whether you're learning from Xiaohongshu or Wikipedia — so it's very general. This is the foundation of general intelligence.

03 OpenAI's Release of o1

Marks the Emergence of a New Paradigm

The next paradigm is doing this through reinforcement learning. Why reinforcement learning? Because as I said, natural data is no longer sufficient. Recently OpenAI released o1, marking the shift from the left paradigm to the right paradigm, because the left paradigm's data is insufficient. As I said, there are only so many math problems in the world. If we want to improve at math, what can we do?

We can keep generating more problems, then solve them ourselves, some correctly, some incorrectly, then learn which were correct and which were wrong, and you can keep improving. This is essentially the reinforcement learning process.

Its paradigm is somewhat different from what I described before. Before, we were finding natural data to predict the next word. Now, after step one we have a reasonably good base model, so it can keep playing with itself, generating lots of data, then learning the good and discarding the bad. Through this method, we create a lot of data.

For example, if you look at o1, it generates a lot of thinking in the middle. What's the core purpose of this thinking? It's essentially a data generation process. Because this data doesn't naturally exist in the world. For instance, when a brilliant mathematician proves a new theorem, or solves some math problem, or competes in a math competition — they only write out the answer, not the thinking process. So such data doesn't naturally exist.

But now if we want AI to generate the thinking process that was originally in the human brain, then learn from this thinking process to get better generalization. For example, if you give a student a very difficult problem, if they just learn the solution directly, they don't actually understand what's happening. What they need is someone to explain: this step was like this, why this approach was possible — there's a thinking process. If they can learn the thinking process, next time they encounter a somewhat different problem, they can still solve it.

But if they only learned the solution, they can only do the exact same problem type each time. They can only say, today I'll solve a quadratic equation, using the same method every time, memorizing this problem type.

If they can learn the thinking process, it's like having a brilliant teacher constantly showing you what the thinking process looks like. You learn the thinking process, and your generalization ability improves. And through this process, you generate more data that doesn't naturally exist — it's itself an excellent supplement. After generating data, this scaling can continue.

And this scaling has also changed somewhat. Originally most scaling happened during training — finding a bunch of data and training on it. But now most computation, or increasingly more computation, will shift to the inference phase, because now there's thinking involved. So the thinking process itself also requires compute, and is itself something that can be scaled — gradually adding more compute to the inference side.

This also makes sense. For example, if today you want a person to complete a more complex task, they definitely need more time. You can't expect them to prove the Riemann hypothesis in one or two seconds. To prove the Riemann hypothesis, they might need to think for several years.

An important next step is how to define increasingly complex tasks. In these more complex tasks, the way models interact with humans may change somewhat — possibly from the current fully synchronous form, to some degree becoming asynchronous, allowing it to spend some time looking up materials, then analyzing and thinking, and finally giving you a report, rather than immediately giving you an answer. This allows it to complete more complex tasks, essentially combining inference-phase scaling laws with reinforcement learning.

04 The Ceiling of This Generation of AI Technology

Is Fundamentally the Ceiling of Text Model Capabilities

I think what determines the ceiling of this generation of AI technology is still fundamentally the ceiling of text model capabilities. If text models can continuously improve in intelligence — being able to do increasingly complex tasks. It's somewhat like a learning process: starting with elementary school problems, gradually doing middle school, university, and now some have PhD-level knowledge and reasoning capabilities.

As text models keep improving, the ceiling of this generation of AI will be very high. I think text models determine the value ceiling of this generation of AI technology, so continuously improving text model capabilities is very important. And as long as scaling laws can continue, this can likely keep improving.

The horizontal axis is adding more modalities, because now everyone discusses "multimodal models" a lot. For instance, visual input, visual output, audio input and output — these modalities, even arbitrary conversions between them. For example, today through a drawing you express product requirements, these requirements can directly become code, and this code can automatically incorporate generated video as a landing page. This task spans multiple modalities; today's AI still can't fully do this. Probably in one or two years modalities can be combined.

Ultimately how well these modalities combine depends on the brain — how strong the text model is. Because the middle requires very complex planning: planning what to do first, and in step two discovering the result differs from what was expected, being able to adjust on the fly, in step three deciding not to do it this way and switching to another approach.

This actually requires very strong thinking and planning capabilities, needing to maintain consistency over very long windows, instruction-following, reasoning ability — these are all determined by the text model ceiling.

But these two things are horizontal and vertical. Multimodal capability is more horizontal development — being able to do more and more things. Text models are more vertical development, determining how intelligent this AI is. Only when it's intelligent can AI do many things.

But if it's very intelligent but has no eyes, the things it can do are also limited. These are two different dimensions. Of course both dimensions will improve simultaneously in the coming period. In the next two to three years, I think there's very high probability both aspects will improve in parallel, so this can wrap everything together. If everything is wrapped together, that's what we call AGI.

I just mentioned that every new technology faces two problems after emerging: the effect isn't very good, and costs are too high. For AI it's the same, but the good news is that efficiency improvement has been quite astonishing.

First it appears in the training phase. For example, today to train a GPT-4 level model, the training cost is just a fraction of what it was two years ago. Even if done well, possibly 1/10 the cost can train a model with equivalent intelligence.

Meanwhile, inference costs keep falling. This year compared to last year, the cost per unit of intelligence produced at inference has basically dropped by an order of magnitude, and next year will likely drop by another order of magnitude. This makes AI business models more viable — the cost of obtaining intelligence keeps getting lower, while the intelligence produced keeps getting higher. For users, ROI keeps increasing, so more and more people will use AI. I think this is a very important trend.

These two important trends combined: on one hand, more and more intelligence is obtained in training; on the other hand, intelligence can be used by people more and more cheaply, so larger-scale deployment becomes possible. Of course models will continue developing. I think if we look at OpenAI o1, an important improvement is that now it can complete some tasks that humans would need to think about for a long time — it's not answering a simple question, but thinking for 20 seconds.

Of course this 20 seconds is because computers think faster; if a human thought the same content, they might need one or two hours. Computers can compress very long processes, being able to complete tasks of increasingly long duration. I think this is an important trend.

05 Three Core Capabilities of Next-Generation Models

Going forward, you'll see AI perhaps able to do tasks at the minute level or even hour level, while switching between different modalities, with reasoning ability also getting stronger. I think these are important trends in AI's coming development.

We hope to combine product and technology. Now product logic has changed greatly from internet product logic. Today's products are largely determined by model capabilities. If model capabilities can't achieve something, there's no way to manifest it in the product experience.

Now there's more of a concept: model as product.

When we build Kimi, we also very much hope to think more tightly about combining product and model. For example, if we want a feature in the product, it needs corresponding model capability support. I think there will be a relatively certain demand: AI assistant. I think in the AI era, the super-app will most likely be an assistant. I think demand for intelligence is very universal — it's just that today capabilities are still at an early stage. Meanwhile, the market is a process of adapting to and embracing new technology; as effects keep improving and costs keep falling, this leads to increasingly strong market adaptability.

I think in the next 5 to 10 years, there will very likely be large-scale market application opportunities. Because I think it's actually addressing universal intelligence demand. Put simply, all the software and apps we use now were developed by hundreds or thousands of engineers, so the intelligence behind them is fixed.

But when human intelligence is encoded through some code (essentially rules), the intelligence is fixed there. It doesn't change.

But for AI products it's somewhat different, because behind them is a model. You can think of the model as having millions of people, and these millions of people are very capable, able to help you complete different tasks. I think its ceiling is very high.

One very important thing here is: if you want to do increasingly complex tasks, you must be able to support increasingly long contexts. So we focused heavily early on improving capabilities in this area, using context length to address reasoning ability. Going forward we'll also focus heavily on productivity scenarios.

I think the biggest variable for this generation of AI is still on the productivity side. Every unit of productivity in society now probably has opportunity for tenfold improvement, so we hope to focus on these productivity scenarios and continuously optimize effects. Of course better effects correspond to model capability improvements.

Meanwhile, I think AI's biggest variable now is viewing data itself as a variable. When you optimize a system, data shouldn't be viewed as a constant — not as something static. This is also different from previous AI research paradigms. For example, seven or five years ago, or even now, many people's approach to AI research is to fix the data — one fixed dataset — then study various different methods, different neural network structures, optimizers. Just improving effects under fixed data conditions.

I think now data increasingly becomes a variable — how to use data, or obtain user feedback, increasingly becomes very important here. For example, an important technology is RLHF (Reinforcement Learning from Human Feedback). The core is how to learn from human feedback. Even if AI has very strong intelligence, if it isn't aligned with human values, or doesn't produce what humans want, the user experience may not be very good.

I think the path to AGI is more a process of co-creation, not pure technology — it should be better integration of technology and product. It's like treating product as an environment, where the model interacts with users in this environment and continuously learns from interacting with users, thus continuously improving.

From 2018 onward, when Transformer began emerging, we also did a lot of Transformer-based research and exploration. Of course at the beginning, we really didn't imagine the final effects could reach today's level. But effects will continue improving going forward, because as long as scaling laws continue to exist, or continue to hold, model intelligence will keep rising.

For me, the entire exploration process has been enormous, stemming from profound curiosity. In this process, uncertainty is everywhere. Yet we tend to be more optimistic than warranted, because we don't know what we don't know. For example, when we first started this project, although we anticipated many difficulties, we ultimately discovered that no matter how many challenges we predicted, reality was always harder than we imagined.

Though first principles may be clear, unknown factors are too numerous. As Daniel Kahneman, author of Thinking, Fast and Slow, said: Often we're willing to try things we don't know, precisely because we don't know how much we don't know — this ignorance gives us courage. When you begin trying, you discover many new problems, and perhaps this is the essence of innovation.

Most of the time your attempts may fail, but occasionally you'll find some solution suddenly works. This happens frequently in our office — you'll see someone suddenly cheer, you might think something's wrong with them, but actually they just discovered some method works, just like that.

I think much of the time, observing what works and what doesn't is the simple process of exploring truth. This exploration isn't limited to technology — whether product or business model, finding what works and what doesn't, or simply exploring the answers themselves, is all very valuable.

Thanks to Tianjin University Xuanhuai College for their contribution to this article

Recommended Reading