30,000-Word Transcript: A Google DeepMind Researcher on Deconstructing OpenAI o1 and the LLM+RL Paradigm | Z Talk

真格基金·December 4, 2024

The most hardcore, no-fluff technical breakdown of o1.

Z Talk is ZhenFund's column for sharing insights.

When it comes to the most closely watched event this year, the release of OpenAI's o1 model is impossible to ignore — OpenAI CEO Sam Altman himself called it the beginning of a new paradigm. Through reinforcement learning combined with Chain-of-Thought (CoT) technology, o1 handles complex problems in physics, mathematics, and programming at a level comparable to PhD students in those fields.

In the latest episode of her podcast OnBoard!, Monica, investment vice president at ZhenFund, sat down for over three hours with researchers who have hands-on experience training large language models (LLMs) at leading AI labs to dissect and interpret OpenAI's o1.

How does reinforcement learning unlock new logical reasoning capabilities in LLMs? Where does this ability come from, how is it implemented, and what is its future potential? What impact will o1's "new paradigm" have on the industry? Over 160 minutes, five voices at the cutting edge sparked countless insights.

This transcript was compiled by "Zhang Wuchang" and edited by ZhenFund. Here is the full 35,000-word conversation:

Key Takeaways

On Building Agent Systems:

A foundation model's reasoning capability is the bedrock of agent development, but building truly effective agent systems also requires solving how multiple AIs collaborate, compete, and divide complex tasks.
Tool use only becomes meaningful with sufficiently broad coverage. The key is improving the model's understanding of tool functions in prompts and its ability to call them. With strong prompt comprehension and reasoning capabilities, plus well-documented instructions, models can correctly invoke these tools.
Building powerful agent capabilities requires four elements: a strong base model with reasoning ability, high-quality tools, well-crafted prompts, and learning how to better use tools through datasets.
The ideal way to collect agent-related datasets is to embed data annotation naturally into users' daily workflows without making them aware they're labeling data — this ensures data quality. Tesla represents the ideal case.

On Chain-of-Thought and Reinforcement Learning:

When solving problems, models perform better when given more detailed steps rather than just the final answer. This is the core idea behind Chain-of-Thought.
Chain-of-Thought splits into two schools: the explicit school, which uses distinct tokens to show the thinking process; and the implicit school, which resembles human intuitive thinking — answers can suddenly appear in a flash, difficult to fully explain with logic.
Traditional language models' biggest weakness is the inability to backtrack. Once an incorrect token is generated, it cannot be corrected. But if models are allowed to reflect and fix errors, their performance on reasoning tasks improves dramatically.
The core of reinforcement learning is learning through an agent's interaction with its environment, guided by rewards — these three elements form reinforcement learning's basic framework.

On AI Feedback Systems and Human in the Loop:

AI can not only rapidly process and comprehend massive amounts of text but also summarize it, giving it unique advantages in handling complex evaluation tasks.
In domains requiring substantial time and effort to evaluate (such as reading two long novels and summarizing their themes), AI can provide more efficient feedback than humans — an underappreciated value of AI feedback.
The most scalable solution going forward is "Human in the loop paired with AI feedback" — AI distills complex problems into forms humans can understand, with humans making the final judgment.

On Multi-Agent and Role Classification:

In language models, so-called Multi-Agent is essentially more like Multi-Task — the model must switch between generating content and evaluating results.
By prompting different personas, we can separate the generator and critic roles, letting each focus on generation or evaluation. This is the main direction for Multi-Agent applications in current language models.
Models face attention-switching challenges when juggling multiple tasks, which is why role separation is needed for more effective Multi-Agent systems.

On Single-Agent vs. Multi-Agent:

Before Single-Agent reaches superhuman levels, Multi-Agent will necessarily perform better because it provides diverse perspectives and approaches — just as human society requires division of labor and collaboration to achieve major breakthroughs.
Multi-Agent is likely a transitional phase. From AGI's ultimate goal, the future should be a single model handling all tasks, not multiple AGI models collaborating.
Multi-Agent is currently used mainly to address model instability in corner cases and reasoning processes. As base model capabilities improve, this need will gradually diminish.
Many problems currently solved through Multi-Agent — such as accurate tool understanding and invocation — may be replaced by more powerful single models (like o1).

Guest Introductions

Kimi Kong: Currently a research engineer at Google DeepMind. First encountered reinforcement learning while studying at Stanford. Research spans robotics to large language models, with a systematic understanding of reinforcement learning theory and its evolution.

Eric Li: PhD from Caltech, currently a research scientist at Google Cloud. Many have speculated that o1 applies Monte Carlo Tree Search (MCTS) to LLMs, considering it one of the important methods for enhancing logical reasoning. Eric has published multiple papers combining LLMs with MCTS and is an undisputed expert in this domain.

Su Hui: Former WeChat AI researcher, currently head of large models at a leading domestic internet company.

Cage, co-host of this episode: Former data scientist at ByteDance, currently a researcher at Shixiang Tech, contributor to the newsletter "Overseas Unicorn."

Monica, host of OnBoard!: Investment vice president at ZhenFund, formerly with AWS's Silicon Valley team and an AI startup.

Note: This episode was recorded on September 27, 2024.

01 Guest Self-Introductions

Monica:

The most hardcore, most substantive technical breakdown of OpenAI's o1 model you've been waiting for is here! The most notable event recently has of course been the September 12th release of OpenAI's o1 model. Anticipation has been building for some time, and OpenAI CEO Sam Altman has called it the beginning of a new paradigm.

By combining reinforcement learning with Chain-of-Thought reasoning, o1 handles highly complex problems in physics, math, and programming at a level comparable to PhD students in those fields.

This time, I've invited several heavyweight guests for an over-three-hour interpretation. They all have frontline experience training large models. Two of them come from Google, the absolute stronghold of reinforcement learning — also the birthplace of world-leading RL work like AlphaGo and AlphaFold.

First, I'll ask each guest to introduce themselves — briefly walk us through your background and how you got into LM (language models) or reinforcement learning. And per our usual format, besides o1, if you've come across an interesting project or paper recently, feel free to share. Let's start with today's returning guest, Eric.

Eric Li:

Hi everyone, I'm Eric. I currently do LLM-related research at Google, mainly focusing on LLM post-training reasoning and Multi-Agent. I started working on LLMs about two years ago, when the concept of Instruction Tuning had just emerged. We were working on some FLAN (fine-tuned language net) related models, primarily scaling up Instruction Tuning data and studying its effects on models.

I got into reinforcement learning mainly starting last year, doing RL-related research and work internally at Google on PaLM 2 and Gemini. Recently, I've found a series of papers combining LLMs with MCTS quite interesting — integrating planning into LLM reasoning is a very promising direction.

Monica:

Perfect — MCTS is also a topic we'll discuss later. For those not yet familiar with this term, Eric can give a brief introduction here.

Eric Li:

MCTS is Monte Carlo Tree Search, a relatively classic search algorithm. Its most famous application was widely used in Google's Go AI project and became known to the public through that.

In LLM reasoning, Monte Carlo Tree Search is mainly used in two ways: first, generating higher-quality synthetic reasoning data, and second, incorporating planning into reasoning steps at inference time (during the reasoning phase, the model uses MCTS to plan multiple reasoning paths, helping it select the optimal reasoning result). MCTS can be used to optimize reward and reasoning paths. I think both are very interesting directions.

We recently had a paper that uses MCTS to help annotate process supervision data. When large models do reasoning, certain reasoning steps may go wrong, but having humans annotate the correctness of each reasoning step is extremely resource-intensive. We use MCTS plus some Monte Carlo estimation methods to optimize this process, proposing a method that requires no human involvement at all — relying solely on AI to obtain feedback annotations.

Monica:

Links to the papers are in the shownotes. Let me ask a follow-up — everyone says that to continue improving reasoning capability, we need to add multi-step data. Is this mainly in the pre-training or post-training stage?

Eric Li:

It mainly works in post-training. For example, in the reinforcement learning process, if you only use classic RLHF (Reinforcement Learning from Human Feedback), you might only know whether an answer is right or wrong at the very end, and the model needs to figure out on its own which steps in the entire reasoning process went wrong or were very correct.

But with this process supervision data, you can enable the model to better learn its value function, and more accurately know which reasoning step is wrong and which is right during reinforcement learning, which improves the efficiency of RL training.

Monica:

Indeed, whether MCTS is used in LM training, including whether it's used in reinforcement learning, is a topic people often discuss. We'll ask Eric to discuss this together later. Alright, next up is Kimi.

Kimi Kong:

Thank you very much, Monica, for the invitation today. I'm Kimi, and my Chinese name is Kong Lingjie. I have dual Master's degrees in Mechanical Engineering and Computer Science from Stanford, though I still haven't claimed my CS degree so I can stick around Stanford to do a part-time Business School program.

I was originally trained in robotics and control theory, mainly working on state-space models — but not the model-based state-space models people talk about now. Rather, pure control theory state-space models, the classical control theory lineage of model-based approaches.

My entry into AI/ML was actually quite accidental. In 2016, when I was finishing up my mechanical engineering degree at Stanford, I met Stefano Ermon. I was taking his classes on probabilistic graphical models and deep generative models.

One day it was raining extremely hard, nobody came to class, and I was the only person left in the classroom. That's how I got to know Stefano. He strongly encouraged me to explore using learning-based approaches to solve robotic control problems.

Later I said, if Stefano writes me a recommendation letter, I'll apply for the CS degree. Luckily, Stefano wrote me the letter, and I got into Stanford again.

Before that, I interned at Microsoft in 2016, and after graduation I went to AWS where I was colleagues with Monica. At AWS, I led two projects:

One was a distributed simulation project, helping Amazon robots use distributed methods to do more searching and data collection, improving reinforcement learning training speed;

I also led a medical image CV-related project. After that, in early 2023 — the week before Google's massive layoffs — I joined DeepMind. At Google, I initially helped them use AI for some forecasting tasks, and later as LMs developed, I mainly worked on Gemini's auto eval — basically using LMs to evaluate whether new models' performance was good when they came out, which is a scalable solution. Recently I've mainly been working on the agent direction.

Monica:

So everyone, don't skip class lightly — every class might hold a surprise.

Kimi Kong:

Although that was a video-recorded class, I very clearly remember I was two minutes late that day. When I walked into the classroom, I saw Stefano looking a bit lost, because with nobody there he thought he'd have to teach the online class by himself. Seeing someone come in, he was very happy.

I helped Google's Search Department use agent methods to improve their ad click-through rates. Speaking of papers and projects, I've been very inspired recently by a relatively early paper about the scaling law of reward model over-optimization, published by OpenAI in 2021 or 2022. I've read many related papers, with particular focus on the reward model piece.

Actually, when doing reinforcement learning, the reward model is a very mysterious component, because to this day, nobody truly knows how to define and design a good reward model.

I got a lot of inspiration from reading that paper. Recently I've become quite obsessed with the tool Cursor — I use it every day after getting off work at Google. With Cursor, I can accomplish in three hours at home what would take a week of coding at Google. It's really mind-blowing.

Monica:

As a senior programmer, do you think Cursor will replace Copilot for you?

Kimi Kong:

I think Cursor has one great feature that Copilot lacks, called composer. Cursor is essentially a fork of VSCode, since Microsoft's VSCode is an open-source project. Cursor's underlying layer connects to various different large models, including Claude 3.5, and recently also GPT-4o.

Cursor's advantage over Copilot is that Copilot behind the scenes may only connect to some smaller Microsoft OpenAI models. Although it later also connected to GPT-4o, because costs are very high, it never really brought out the best models. Cursor can easily connect to the best models, like Claude 3.5 and various others.

I've already deleted VSCode. Cursor has done a lot of interface optimization for AI programming. I particularly like the composer feature, which helps quickly scaffold a project. It's especially useful for machine learning engineers, because my frontend skills are quite rusty, and I haven't done backend in many years either, but I can quickly build Chrome extensions — something that would have been impossible before.

Monica:

Friends following the AI space have probably sensed Cursor breaking out recently. Cursor is a company founded in 2022 or 2023 that received early investment from OpenAI. After using new models, Cursor saw tremendous improvement in language understanding and programming capability.

They recently received a new round of funding from a16z, with a valuation of about $400 million. What's interesting is that both of Cursor's founders are MIT post-00s, and they've proven that IDE (Integrated Development Environment) is still a domain that can be reimagined. From an investor's perspective, I'm quite moved that young people can use AI to create such AI-native products. Thank you very much, Kimi, for sharing. Suhui, you can also introduce yourself to everyone.

Suhui:

Hello everyone, my name is Suhui. In the years before ChatGPT came out, I worked on dialogue system research in WeChat's AI team, including research work from earlier eras. That period went through the research transition from traditional language models to LM research. After ChatGPT came out, I joined the entrepreneurial tide.

After a period of entrepreneurship, I'm now at a major tech company responsible for the large model direction, mainly in charge of model training, as well as some cutting-edge research studies and exploration of innovative applications.

I've been following AI development since early on, witnessing various design changes, shifts in training paradigms, and iterations of various architectures. Now I'm mainly doing large-scale exploration in application scenarios, researching ways to deploy reinforcement learning, and finding effective paths from user feedback to model iteration. Regarding the Cursor project, I'm a heavy user — basically at the point where I can't live without it, though the previous guests have already discussed it.

I really like Allen Zhu's Physics of Language Models series, which he's been working on since last year through to recently. It doesn't have the strongest connection to reasoning, but it does include quite a few relatively solid experiments and conclusions in this area. Although the experimental scale is small, the work is very rigorous. I think a lot of research papers should learn from this controlled experimental paradigm.

I think following his work to study reasoning — including its relationship with Chain-of-Thought and how to improve through reasoning — is a very good starting point. I'd also like to recommend this work here to researchers who are just getting into the LM or reasoning direction.

Monica:

Why do you think this is a research methodology worth learning from?

Su Hui:

Because some approaches to research are based on specific versions of models or certain model families. These research conclusions sometimes lack a rigorous foundation, since you're constrained by these models' data formats or data composition. For you, this is a very black-box environment, and your test data likely has unknown couplings with its pre-training process. So many conclusions aren't solid enough.

He designed a completely controllable environment, where everything from data to architecture is under his own control, and the training data is entirely synthetic. This way, the difficulty and logic are fully autonomous and controllable, and the final experimental results depend on your data. This allows you to eliminate data interference when doing research. He also does scan work quite rigorously, observing changes at certain sizes and deriving some good conclusions. Although he couldn't go to particularly large scales due to compute resource limitations, teams with compute resources can scale up to larger sizes to verify, and propose their own theories and experimental designs.

Monica:

Next, let's have today's co-host Cage introduce himself.

Cage:

Hello, thanks to Monica for the invitation. I'm currently doing AI technology investment research at Shixiang Tech, where we mainly research overseas AI unicorns.

Before o1's release, we wrote an article titled The Paradigm Shift in LLMs: RL Brings New Scaling Laws, which did quite a bit of analysis and prediction on reinforcement learning strategies and technical approaches. The release of o1 confirmed those analytical expectations.

Before joining Shixiang, I worked as a data scientist at ByteDance and at CMU's (Carnegie Mellon University) NLP research lab. That was right when GPT-2 was at its peak — I worked on text analysis combining BERT and VAE.

Speaking of fun facts, recently while researching papers on LM combined with MCTS, I came across a very interesting cognitive science article in Nature that's highly relevant to o1's capability ceiling.

The article is titled Language is primarily a tool for communication, rather than thought. Its main argument is that language may not directly bring human thinking and reasoning abilities — reasoning ability only reflects thought to a certain extent and enables cultural transmission. For example, aphasic patients retain complete logical reasoning capabilities.

This has an important implication for the o1 reinforcement learning route we're discussing today: the extent to which language can reflect and compress our thinking and reasoning processes may determine the future capability ceiling of LLMs under the reinforcement learning technical approach.

Monica:

A very interesting article. If this hypothesis is correct — that we can do reasoning beyond language — what impact do you think this would have on model training methods and the data required?

Cage:

Yes, I think it's quite possible that human language isn't the best form for reasoning. Although we currently see o1's Chain-of-Thought expressed in English, going forward AI might invent a more efficient formalized logical language for Chain-of-Thought, which could enable more efficient communication between AIs.

Monica:

Excellent. The self-introduction segment had many surprising elements, and beyond our entire structure, it allows everyone to more forward-looking-ly feel that these excellent guests we've invited are all paying attention to the cutting-edge developments in the industry every day.

How to Scale Annotation and Filtering of High-Quality Data Is the Most Fundamental Question

Monica:

Getting back to our main topic — today's theme is OpenAI's o1 release. As senior researchers who have been working in this field, I'd like to ask what your first impressions were after seeing o1's release and trying it out personally? What impressed you?

Eric Li:

After experiencing o1 myself, I mainly had these impressions:

First, at the research level, I find its overall broad approach very interesting. They truly proposed and implemented a solution for scaling up inference time, which could potentially bring better performance improvements for reasoning.

In practical use, one thing that surprised me is that for any reasoning problem, its thinking process spontaneously exhibits different thinking and reasoning patterns. For example, it considers on its own whether it should think step by step or evaluate errors in its previous thinking. This ability to autonomously decide how to think next, I find very interesting. This is a characteristic I hadn't seen in previous models like GPT-4.

Monica:

But actually, the logical reasoning processes shown by o1 are still relatively limited. What do you think they're hiding that you wish could be shown to everyone?

Eric Li:

Actually, this is quite similar to what one of the previous guests mentioned — I'm not entirely sure whether the thinking processes the model hides are human-readable.

For example, previous research on Chain-of-Thought found that the longer the chain of thought, the better the model's performance. There have also been studies trying to add special think tokens, finding that this indeed makes the model think more and improves performance, but these think tokens are very difficult for humans to understand.

If this thinking process were readable, I believe the model would show more content — not just reasoning patterns about what to do next, but deeper-level thinking including why to choose certain steps, self-reflection, or why to decompose problems into specific sub-problems.

Monica:

What do you think wasn't done very well?

Eric Li:

I did try some tests myself. For example, the classic case of counting how many letters are in "strawberry." I found that o1 still can't achieve very high accuracy on this. But I think this is acceptable, if it's just a large language model rather than a system. Some things really don't need to be done by a language model — like doing calculator computations and so on.

I'm more concerned about whether its internal reasoning patterns can have some interesting manifestations.

Monica:

Eric mentioned testing how many r's are in "strawberry." Some listeners might be curious why people always like to use this question to test language models?

Eric Li:

I personally don't think we need to insist on language models being able to do this, because it involves internal implementation details of the model, including how tokenization is done and other technical specifics. These tasks are by nature more naturally done with some tool use.

For humans, giving one or two examples is enough to do well, but giving a language model two or three examples doesn't guarantee it'll do well. This is a relatively simple test method, used to check whether the model can understand the mapping from input to output.

From a more scientific perspective, tests in mathematics, programming, or some harder domains like quantum physics might better reflect the model's reasoning performance.

Monica:

What about Moonshot AI?

Kimi Kong:

Finally, I'd like to quote UCLA mathematics professor Terence Tao, who said that using o1 feels roughly like "advising a mediocre but not completely incompetent graduate student."

I think in some ways, o1 is indeed very impressive to me. For example, when I previously used Cursor with Claude 3.5 Sonnet, it would often write buggy code. I'd run it, paste the error message back, and it would say "oh I'm sorry" and fix the previous error, eventually getting the code to run.

With o1, it can write code for me very smoothly. This involves behind-the-scenes questions about how they do self-correction when code errors occur. This makes me think about the reasoning token question: is it explicit or implicit?

When looking at o1 preview, what interested me most was the math problem examples. I think math and programming are quite similar overall. When solving math problems, it keeps thinking: let's consider this approach, actually, let's consider another approach, showing an ability to continuously self-refine its thinking process. This way I don't need to intervene to correct many errors in the middle — this is the good aspect of o1.

As for the not-so-good aspects, it's like what Terence Tao said about the mediocre graduate student to some degree. Someone online had it answer how to install CUDA, and after thinking for 27 hours it said "I don't know." This shows that while it's quite impressive in certain areas of expertise, it still has many limitations in other areas. I'm very much looking forward to their future work addressing these issues.

Monica:

What other limitations do you hope to see improved in possibly the next version?

Kimi Kong:

A few aspects. First, how to get more data coverage. Second, how to make data evaluation methods more scalable.

One thing from OpenAI that I find very fascinating is their PRM (Process Reward Model) work from many years ago. I think OpenAI must have spent an enormous amount of time researching how to work with data.

For Google and other companies alike, the most fundamental question is how to create large volumes of high-quality data, and how to filter for high-quality data in a scalable way.

When filtering for high-quality data, you need a scalable approach to labeling reward signals — not just sparse rewards. For instance, unlike math problems where you only look at whether the final answer is right or wrong.

For many problems, there simply isn't a closed-form solution. It's extremely difficult to evaluate whether something is good or bad. So how do you define a systematic approach to annotating high-quality data at scale? I find this a deeply fascinating question. If this problem can be solved, I expect these reasoning tasks could achieve another qualitative leap forward.

Monica:

You mentioned that OpenAI has released a lot of work related to data. So what kind of data acquisition and processing methods are needed to train a model like o1? How is this different from traditional LM training?

Kimi Kong:

That's a great question. When OpenAI first released InstructGPT, Google was still focused on producing high-quality SFT data (annotated datasets used during supervised fine-tuning). InstructGPT took a different path, choosing to work with preference data (particularly data expressing user or annotator preferences during reinforcement learning stages like RLHF).

Whether you're doing SFT or RLHF preference data, you need very good data. But what's interesting is that high-quality preference data is actually easier to obtain than high-quality SFT data. That was the first thing that really impressed me about their approach.

This preference data is sparse, meaning you can only evaluate the quality of the entire conversation after it ends. If there are many intermediate step reasoning processes in the middle, you can't score each individual step.

To solve this problem, they released the PRM800K dataset, a "verify step by step" dataset. This research direction has continued all the way through to o1's development today. Fundamentally, what we're trying to solve is how to annotate high-quality data in a scalable way.

These high-quality data don't necessarily have to be SFT data — they can be preference data, or perhaps someday we'll discover an approach even easier than annotating preference data. If the scaling law in data could achieve another 10X or 100X improvement, models might reach new breakthroughs in knowledge.

Cage:

Kimi just mentioned scalable, which makes me want to discuss InstructGPT. Regarding Anthropic's Constitutional AI paper and the Reinforcement Learning from AI Feedback (RLAIF) approach, I've been thinking about a question: If we're preparing high-quality reasoning tokens data, what should the ratio be between human high-quality annotation and annotation that could be assisted by AI in the future?

Kimi Kong:

There are several ways to use human annotation. The most direct is Direct Preference Optimization (DPO). Many people have found that when doing RLHF, training a reward model is too complex, and training requires PPO (Proximal Policy Optimization) — you not only need to keep the current model in memory but also the previous model.

This complexity has pushed us toward DPO. The advantage of DPO is that it doesn't require machine-generated data; human-annotated data can be used directly for training. This is the most straightforward approach.

But there's a classic "chicken and egg" problem here: you need a good model to create high-quality data, but before that you need to train a high-quality model. So the typical approach is to first use human-annotated data to train a reward model, then use that reward model to annotate other data that doesn't have preference labels, as if it were human.

This RLAIF approach has a potential reward hacking problem (when optimizing against a reward model's scores as rewards, if the reward model doesn't fully represent human preferences, reward hacking can occur). As humans, we can systematically analyze the quality of different responses, but problems can arise in practice.

For example, when faced with unsafe questions, the model might simply choose not to respond, and the reward model might consider this good — which is a terrible outcome. The model should respond, but the language model might behave abnormally as a result, creating a back door in the model.

Overall, this is a very interesting but thorny topic. We need to invest more time in researching how to train reward models; this is foundational work for scaling RLHF or RLAIF training.

Su Hui:

Let me share my testing experience with o1. Beyond testing LeetCode Weekly Contest problems, I paid particular attention to travel planning in complex scenarios.

By complex scenarios, I mean things like family international travel, where I provide flight times and attractions in the prompt. When I previously tested GPT-4, the itineraries it produced looked decent on the surface, but examining the details revealed problems — like failing to reasonably account for travel time, resulting in days where most time was spent in transit with very little actual sightseeing.

This time, testing o1 was very impressive, particularly in how it accounted for time zone differences. Since I often choose Beijing and New York as test locations — two cities the model has learned about the most — it would convert time zones properly, judge arrival times, and suggest resting before scheduling activities. And like a thoughtful local guide, it would consider regional differences, such as varying museum opening and closing times between China and the US.

If we only talk about LeetCode Weekly Contest problems, that mainly reflects the model's capabilities in code and mathematical reasoning, where rewards are relatively easy to define in reinforcement learning. But generalizing to travel planning scenarios like this — if not for generalization capabilities, I think it would be very difficult to achieve.

I think there might be two explanations: One is that they've found a good way to define rewards for general tasks, giving reasoning effective feedback; the other is that training on strong reasoning directions like code and math can also generalize to these kinds of scenarios. From the results, it has indeed achieved very good generalization.

Monica:

As you mentioned, travel planning is a relatively complex task in daily life. How is the reasoning required for this different from the reasoning done for coding or math problems?

For instance, an excellent private secretary, a great travel agency, or a good EA (Executive Assistant) doing this work doesn't need to be an IOI gold medalist and doesn't need to know coding. So how should we understand the relationship between these two types of capabilities?

Su Hui:

I think this is a question of how we define reasoning. For example, when you do coding or math reasoning, you're solving a well-defined problem with intermediate reasoning steps that tend to be logically rigorous and symbol-based. But there's also a large amount of reasoning that's based on your commonsense understanding of the world.

Let me give an example: if it's raining now, then selling umbrellas might be a good business. This is actually a reasoning process — you need some general knowledge about the world, and you need to generalize to new scenarios. If no one has sold umbrellas in the rain before, you might generalize from other business scenarios and infer that selling umbrellas in the rain would work better.

Travel scenarios are closer to this commonsense-based reasoning I just described. Because the considerations involve logical sequential relationships — for instance, in a large family, if elderly members have limited stamina, you need to think about what kind of itinerary to arrange.

Previously, this often required a fairly complex agent pipeline, with extensive business understanding, custom rules, and careful prompt design. But now it can understand very well that "comfortable" means not spending excessive time in transit — this is commonsense-based reasoning.

03 What Drives OpenAI o1's Capability Improvements?

Monica:

I'd like to ask: what are the main sources of o1's improvements in reasoning? If we were to break it down, what important components were added to the traditional LLM training paradigm to give it these capabilities?

Kimi Kong:

I'll hazard a few guesses here — I don't actually know how they specifically trained it. If I had to guess, I think the most critical factor is data (it's all about data). Reasoning is actually a foundational capability that large language models do very well.

Why? Because this data is extremely easy to obtain. Stack Overflow, for instance, is a mapping from questions to code. Wikipedia is Q&A-formatted data.

Not only is this data easy to obtain, its quality is high. You can see how many times a Wikipedia page has been clicked, how many times a Stack Overflow answer has been upvoted — it's very easy to judge data quality. So it's quite natural that models perform well in this area.

When we talk about reasoning, we first need to consider how to define reasoning, and more critically, how to obtain reasoning data. If I asked Monica, what do you think is a good reasoning dataset, and where would you look for this data?

We know Wikipedia is a great source for Q&A, and Stack Overflow is an excellent Q&A platform for coders. But honestly, I'm not sure what truly counts as good reasoning data, or where to find it.

Monica:

There are also papers and other sources, like Reddit and Zhihu Q&A content.

Kimi Kong:

Right, but those are all pretty noisy. Zhihu does have some decent AI/ML popular science content that could serve as good reasoning data. But fundamentally, datasets that contain very long chains of logic are basically not publicly available.

Su Hui:

So they're valuable, right?

Kimi Kong:

So OpenAI actually took a different approach to generating this data. My personal bet is that much of it was generated through various synthetic methods, using different filtering techniques to keep the good stuff.

Take writing a math problem: 3X plus 5 equals 100, solve for X. When you know the correct answer is X equals 50, you can ask an LLM: help me reason through this step by step, forcing it to lay out the complete reasoning process. If the final result isn't correct, you say okay, this is bad reasoning, I don't want it. You run it a hundred times, then use heuristics or a reward model to filter out high-quality reasoning processes. If you have no idea what's right or wrong, you can use self-consistency to filter.

I believe reasoning capability can be continuously distilled. Like when I'm writing my PhD thesis — you read lots of papers, think through them, reason about what ideas you might have, and finally arrive at your own thoughts.

It's a process of constant absorption and digestion. It's just that for LLMs, we have to force them: No, you must reason. Then tell me what you're thinking step by step. Make them show us their process of digesting knowledge, then feed that data back to train the LLM for better reasoning ability, rather than simply spitting out an answer.

Those are my personal thoughts, and I'd very much like to hear what other guests think.

Monica:

The form of this data is different from traditional one-shot learning. Do you think there are training methodology challenges?

Kimi Kong:

For language models, there are basically two training methods: pure SFT and RLHF. I think DPO (Direct Preference Optimization) has become so generalized that it's not that different from RLHF anymore. If you could be absolutely certain all your data is very good, I think SFT would work fine. But as I said at the start, it's very hard to generate really high-quality SFT data.

You might have two results, neither of which is exactly what you want, but A is slightly better than B. Then you can use A's trajectory and push the model in a better direction through reinforcement learning. Like, okay, I prefer A — when you see this kind of result, lean toward doing A more. Even if A isn't perfect, please don't lean toward doing B.

Through this step, the model learns a better solution. Based on the previous base model, you now have a step-better model. You ask this model the same query again: okay, I know you do one step better now, based on this problem, please reason through it again. You get two new preference data points. Oh, this time B is better than A, and this B is not only better than A but also better than the previous A. This way you can push the model's frontier forward again. Through continuous iteration and reasoning, the model gradually develops stronger reasoning capability.

Fundamentally, this is a reinforcement learning approach. This reminds me of the self-play topic we'll discuss next.

Monica:

Recently everyone saw DeepMind's Alpha Geometry performing very well on specific math tests. I was wondering, if you had it solve various math problems to generate data, could that also be used for training models like o1?

Kimi Kong:

I don't actually know what Alpha Geometry's base model is specifically. But as a previous guest mentioned, you need a very strong base model to get better performance in a specific domain. If the base model isn't good enough, solving domain problems is basically very difficult.

Regarding what you just mentioned — solving problems in a specific domain is actually relatively simpler, because you can train with a more specific reward model. If you can train a domain-specific model and the data quality is good, you can absolutely use that data to feed back into a more general model. Those are some of my personal thoughts.

Monica:

Very illuminating. Eric, anything to add?

Eric Li:

I think there are mainly two points: data and reinforcement learning. Based on o1's strong reasoning performance, I think we need a lot of reasoning preference data, which is quite similar to the reward model process Kimi just described.

If you want to train a good o1 model, I think at the data level you should make its reasoning steps make more sense, be more efficient, even more optimal. So designing a reward model to judge the quality of reasoning steps is the most important thing.

Once you have a reward model, the synthetic data piece becomes easier to solve. Including the MCTS we mentioned earlier — you can generate better synthetic data based on the reward model. These methods combined can produce higher-quality reasoning data. I believe model-generated reasoning data is far better than human-generated data, because in practice, most human-generated content lacks logical structure, while models actually follow certain logic. So synthetic data is likely a major factor in training o1.

Additionally, I think the importance of reinforcement learning has become even more apparent. I recently saw an OpenAI researcher share a presentation on "don't teach incentive." This is different from Google's emphasis on SFT (Supervised Fine-tuning) and insertion tuning two years ago. Because LMs are now so powerful, directly teaching them how to do reasoning is actually very difficult, and probably not optimal, since human reasoning isn't necessarily optimal either.

I think we should use a reinforcement learning approach, letting the model explore how to reason on its own — we just tell it whether the result is good or bad and give rewards or penalties. This way the model might find better reasoning approaches than humans. o1 gives me the feeling that the importance of reinforcement learning has been amplified — it's no longer just the tool for alignment or safety in traditional InstructGPT.

Monica:

Is this somewhat like AlphaGo? During play, it could create moves that even top human players hadn't thought of.

Eric Li:

I think current LLMs do have this capability. For example, when we do RLHF, we often encounter a very headache-inducing problem: reward hacking. The essence of this problem is that the model is so capable, it can find imperfections in the reward model and exploit them to increase its reward score.

But this doesn't mean it actually found a better solution — it just exploited a loophole in the reward model. If we can have a good reasoning-related reward model, I believe LLMs can find better reasoning paths on their own and achieve self-optimization. This also reflects a very common phenomenon in AI: AI can replace many human-designed model architectures or workflows and automatically optimize them.

Monica:

Then let me follow up on one last thing. If I don't need the model to learn relationships between steps, does that mean with a really good reward model, you wouldn't actually need so much multi-step data?

Eric Li:

Right, these are interconnected. Multi-step data can only work if your judgment of each reasoning step, the reward scores you give it, are all very reliable. If you have that, then this kind of dense reward is very useful for training.

But from what o1 suggests to me, when doing reasoning we don't need to use SFT to tell the model what to do. For example, with the problem a guest mentioned earlier — 3X plus 5 equals 100 — you don't need to first calculate 100 minus 5 equals 3X. The model might directly use formulas or other better methods to solve it. The key is not using human reasoning steps to teach it how to reason, but rather evaluating each reasoning step or the overall reasoning path, simply rewarding its reasoning.

Monica:

Let's hear your thoughts.

Su Hui:

I think there's one particularly important approach that addresses a problem many people ran into when combining MCTS and reinforcement learning with language models. It's about the granularity of reinforcement learning — whether you operate at the token level, or provide feedback at the sentence or step level.

I've looked at quite a few examples, especially the complete examples OpenAI published on their website, and noticed some very interesting characteristics. Some don't have obvious delimiters, but contain filler words that resemble the pauses we make in human conversation. Like when we're solving a problem and thinking, "Could I draw a line here? Doesn't seem like it would work," then pausing, adding an "um" — these features of the thought process are preserved in the complete Chain-of-Thought.

I suspect this contains traces of human annotation. They likely obtained a batch of high-quality Chain-of-Thought data, segmented it by step, and had the model learn this mode of thinking. After each step, a reward model provides feedback, determining whether to backtrack or perform reflection. This approach has been proven viable, giving many people confidence that continuing to explore in this direction is worthwhile.

04

Model reasoning capability is merely the foundation for Agents;

building an Agent system presents far greater challenges

Monica:

We mentioned earlier that we don't need large models to solve particularly simple math problems. When asked simple math questions, the model uses extremely complex methods to solve them, employing the highest level of inference. Since the model has strong capabilities and knows this is just a simple comparison, addition, or subtraction problem, or simple reasoning, why doesn't it choose to use a calculator instead? Is this a model capability issue, or an engineering problem with tool use?

Su Hui:

When o1 was released, my first reaction was confusion about why it appeared in this form. OpenAI itself demonstrated that on certain tasks like text writing, o1 might perform slightly worse than GPT-4o, but it completely dominates in strong reasoning scenarios. Many people try to use o1 to solve what I consider fairly basic problems, which really isn't necessary.

If you want to provide a good product, you should implement a root LLM strategy: route tasks requiring strong reasoning to o1, and handle tasks not requiring strong reasoning with GPT-4o or GPT-4o-mini. This would make far more sense from a user interface perspective.

I shouldn't need to care which model is being called — I just want my problem solved. Hard problems go to o1, easy problems go to 4o-mini.

This would be very simple for OpenAI to do, but they didn't. Because OpenAI operates on different logic from those building pipelines or products — it's purely a model service. Every product delivery means delivering a new model. So regardless of whether a query is suitable for o1, it all gets processed with the same logic.

And o1 was trained in a strong reasoning environment. Even when encountering very simple problems, it still has to go through a complex Chain-of-Thought. Although o1 is also a multimodal model, this wasn't particularly emphasized, nor well reflected in the user interface. Actually, things including tool use could all be integrated — the complete version of 4o after the whole release performs similarly to o1, but at this stage they just wanted to show off what this strong reasoning model, o1, is capable of.

Cage:

I strongly agree with what Su Hui said, because I've used it to answer some very simple questions, and it would think for 42 seconds before giving me a very simple answer. So I feel like OpenAI's research and products are somewhat disconnected.

We were talking about Cursor earlier — I feel like if Cursor were doing this, they might first frame the question, then @ it, and when you @ it, it would automatically determine whether to @o1 or @4o, finding the more appropriate model to handle the question. I think this kind of model routing is definitely something OpenAI will move toward next, which would make for a much better user experience.

Monica:

Since people started talking about the agent concept last year, tool use has been mentioned, but to this day we haven't seen general-purpose agents doing very well. People believed the core issue was the foundation model's reasoning capability. The second step is that it needs to understand which tools are available, and what those tools' capabilities and limitations are.

Do you think that if the reasoning capability demonstrated by o1 is strong enough, implementing task execution functionality would be relatively easy, or are there gaps we can't see in this process?

Su Hui:

I think OpenAI is somewhat conflicted about integrating tools, because tools need sufficiently broad coverage to be meaningful. If it's just APIs like Calculator or weather lookup, the effort is large but the product coverage isn't comprehensive enough. Their focus is on improving the model's understanding and calling capability for functions described in prompts.

Research validation shows this works very well in real production environments. As long as there's very strong prompt understanding and reasoning capability, with well-documented descriptions, the model can correctly call these tools at appropriate times and return good results.

Eric Li:

I think a single LLM with powerful reasoning capability is a very foundational basis for building agents. When OpenAI defines different levels of AGI, level 1 is chatbot, level 2 is reasoning — "To figure chat bot level 2 to figure reason then go so here."

Regarding action and decision-making, I believe it can determine how to handle complex tasks. Reasoning is more like the boundary OpenAI is still pushing at the foundational model level. I believe agents will be the next-level technology, but this doesn't mean that once every foundation model is good enough, agents will naturally work well.

Agents involve collaboration among multiple LLMs, multiple AI agents — including competitive relationships, and how to cooperate and divide labor to solve complex systems and tasks. LLMs are just one component. System architecture design, division of labor approaches — these are the challenges we'll face in moving from reasoning to agent systems.

Monica:

Indeed, from the venture investment perspective, we've seen major changes in the agent space this year. Especially in Agent Ops and Agent Info, many new companies have emerged.

These companies focus primarily on engineering implementation and tooling. This shows agent technology is gradually entering actual production environments. People are thinking about how to manage it as a product, as Eric just mentioned, establishing systematic management methodologies. I think this is an important trend we've seen this year.

Kimi, you mentioned earlier that you're working on agent-related work — how will this o1 improvement affect your work?

Kimi Kong:

I want to make two points. First, regarding why OpenAI doesn't do routing. I think OpenAI's foundational belief is "search and learning will solve everything, any over-engineered problem will actually get washed away in the way." So for them, it's not that they're unwilling to do this — it fundamentally contradicts their core philosophy.

Regarding agent development, I strongly agree with Su Hui and Eric's earlier points. If we want to develop models with stronger agent capabilities, I believe four things are needed:

First, you need a very strong base model and reasoning capability. Improving the base model is an excellent approach.

Second, you need very good tools. You can't give me noisy and bad results — they must be clean and accurate.

Third, you need very good prompting. Currently, agent development is still an over-prompting process. When I use some open-source agent tools like AutoGen, HuggingGPT, LangChain, I find a very tricky problem: casually running an agent workflow using GPT-4o (now roughly $15 per million tokens) might burn through a million tokens, and you might not even know what happened.

Finally, learning — including how to incentivize the model to better use tools, when to use tools, why tool A should be used rather than tool B. This requires us to curate extensive agent datasets and solve these problems through two-dimensional approaches.

Monica:

This agent dataset is even harder to obtain than what we discussed earlier. Without this data, could we first implement agents through some engineering approaches, collect data, and then see which parts can be automated or directly handled by AI?

Kimi Kong:

I think there are two points. The first is still the same as before — about how to use tools through this approach. That Tool Former paper Meta published discussed how to create data to teach models how to use tools. Another approach, to put it bluntly — actually, every day I work at Google, I'm helping Google label data. For example, when colleagues ask me to write a feature and I write code for them, it's equivalent to me helping them build a question-to-code dataset.

These can be used to train their internal models. When I'm writing code using prompts and calling tools, I'm essentially helping them build an agent dataset. This is no longer a scientific question — it's a product question. Tesla is a great example, and what's even better is that we help them label data every day without even noticing it while we drive.

But you can't have users labeling data unhappily, because the data quality will be poor unless you pay them a lot. I heard OpenAI hired a bunch of math PhDs at several hundred dollars an hour to label reasoning datasets — just a rumor, don't quote me on that. The key is embedding the data labeling work into the workflow so users complete it naturally. That's what makes a perfect product.

05

Three Core Elements of Reinforcement Learning:

Agent, Environment, Reward

Monica:

Let's talk about Chain-of-Thought. For those who have only heard of it or don't know much about it, could you explain what Chain-of-Thought actually is? This method isn't new — it was proposed a couple of years ago. I'd like to ask: when o1 uses Chain-of-Thought, how is it different from previous applications? Hui Su, why don't you start?

Hui Su:

Chain-of-Thought was first proposed around 2022, originally from a paper by Jason Wei, who is now at OpenAI. His research found that when solving problems, models perform better if you provide more detailed steps in the answer rather than giving the result directly.

Around the same time or two to three months later, another paper proposed the concept of "let's think step by step," where during generation, the model naturally produces output in a Chain-of-Thought manner. These two papers essentially laid the foundation for Chain-of-Thought.

After that, much work built upon and improved this approach, and Chain-of-Thought was quickly applied to tasks including math reasoning, commonsense reasoning, and logical reasoning.

I noticed that after using this technique to benchmark, the improvement was quite significant. This area generated many papers, and researchers also applied Chain-of-Thought to reasoning and visual language models.

Currently there are mainly two schools: The first is the explicit school, using explicit tokens to represent the thinking process. There's a lot of room to explore here — your Chain-of-Thought itself can be a chain structure, a tree structure, or even a graph structure. The generation isn't limited to a linear Chain-of-Thought; you can also do verification and refinement. We can introduce critic models or reward models to improve Chain-of-Thought generation. Some work decomposes the problem itself to make the Chain-of-Thought more structured, all of which can improve results. These explicit methods require more inference tokens, which connects to the current discussion around scaling inference and compute.

The other school does this implicitly. Recently some researchers have been trying to integrate system two into system one. Although this is a difficult task, we believe transformers are quite powerful. This actually resembles human thinking — when we think, not every process needs explicit verbal expression. In reasoning, even while you're thinking, sometimes the answer suddenly appears at a certain moment. This process is more like intuition and is difficult to explain through logic.

Recently I've noticed some interesting phenomena. If we view reasoning as related to traditional tasks, Zeyuan mentioned in Physical RM that although we found total parameter count correlates with loss and model performance when doing scaling work, on reasoning tasks, depth (the number of model layers) matters more than width (the number of neurons per layer) — deeper models perform better. Interested researchers can experiment to verify this.

We've also seen much work confirming this. For example, the recent MiniCPM-V3, though a small model, uses over sixty layers. The industry is converging on this conclusion: even with fixed parameter counts, we'd rather sacrifice inference cost.

Because the deeper the model, the higher the inference cost — for instance, in optimization, wide models are easier to optimize than deep ones. But we'd rather increase layer count to improve reasoning ability, because generating each token requires computation through all layers. If we consider the relationship between total generated tokens and total layers, more tokens and more layers per token may both improve reasoning. This means not only increasing the number of generated tokens, but after increasing model depth, the multiplication of the two significantly increases computation during inference.

Actually, when both layers and tokens increase, each token passes through more layers, and the number of tokens itself also increases. After multiplying these, computational cost becomes higher.

At this level, we do find this improves reasoning performance. Including operations like reflection — this was actually being done by many people in previous LLMs.

Because the biggest problem with traditional LMs is they can't backtrack. If they generate an incorrect token, there's no way to correct previous mistakes — they can only continue generating along the wrong path, which causes many hallucination problems. But if we explicitly learn this pattern, allowing the model to reflect on previous issues, admit it might have problems, and then get a chance to backtrack, adding this data pattern to training significantly improves performance on reasoning tasks. In a sense this also increases the number of generated tokens, since reflection introduces additional tokens. But ultimately we see one conclusion: whether by increasing layer count or directly increasing the number of generated tokens, reasoning performance can be improved.

Cage:

I'd like to ask a question. We discussed CoT earlier, and also talked about MCTS. Could the guests explain their relationship in the o1 framework? Because CoT's subsequent evolution also has depth in layers, and developed into Tree-of-Thought, which already sounds quite close to MCTS's thinking. So I'd like to ask how coupled everyone thinks these two are?

Hui Su:

Technological developments influence each other. You'll find that work in different directions eventually shows some similarities. These works were actually initially developed independently — on one hand researching how to improve model performance through parameters, on the other hand improving model performance at the algorithm level. But ultimately they all converged to similar approaches like MCTS.

Monica:

Do you think o1's use of training source methods might differ from how we previously trained LMs?

Hui Su:

Actually there's a big change. There was an embarrassing incident before with that reflection model — maybe two months ago on Twitter, around the same time as Llama 3 V1, it felt a bit like a prop. It actually just used a small portion of reflection data for SFT and then claimed to be a very strong model, but finally everyone found it wasn't that good. In a sense it was somewhat dishonest behavior.

But this pattern is actually worth verifying. During SFT, if we use some higher-quality reflection data, it's different from traditional Chain-of-Thought. Traditional methods solve problems step by step with no backtracking — I won't reflect on where previous problems occurred, it's completely sequential execution, where the next conclusion must be derived from the previous step. But with reflection, there's much more room for backtracking.

The model likely already knows how to solve the problem before generating Chain-of-Thought, but if it makes mistakes during generation there's no chance to go back. This is quite painful, but if given the chance to reflect, as long as it can eventually determine the solution, it can ultimately get it right. This is the biggest difference between what o1 demonstrates and what we did before. Of course, some previous Chain-of-Thought work also contained this kind of naive thinking.

But if learned only through SFT, or just through an external verification model to achieve backtracking, without a sufficiently strong reward model providing policy learning, the effect will be much weaker. The model may only learn a superficial behavior — that it can backtrack, and maybe even backtrack when it's already correct. It just learned a pattern without truly understanding what it's doing.

Monica:

Regarding the question from earlier, I'd like to hear Eric's thoughts.

Eric Li:

I think these two are related, as another guest just mentioned — somewhat converging to the same destination.

On the Chain-of-Thought side, we've seen many derivative studies. If Chain-of-Thought is a chain, then there might be Tree-of-Thought, Graph-of-Thought, and so on. These all explore when your reasoning structure has multiple different choices, which one to choose.

MCTS, as a more traditional planning or search method, estimates in traditional reinforcement learning which action among multiple possible actions can obtain greater reward, greater value.

The development path of MCTS comes more from AlphaZero — it developed in the relatively specific domain of Go. But Chain-of-Thought, Tree-of-Thought, Graph-of-Thought and this series evolved more from natural language processing, from ideas that developed within language models themselves.

At their core, I think both are exploring how to plan and reason, so in that sense, they're actually quite closely related.

Monica:

Everyone's been speculating whether o1 uses MCTS internally. I'm curious what your guess is?

Eric Li:

I'm not entirely sure myself, but I think if they were to use MCTS, there would probably be two approaches.

The first is using it at inference time, which would require a very good reward model. During the thinking process, the system would constantly try different paths, like playing Go. For example, when we're halfway through a game and need to decide the next move, if there are five different options, I'd estimate the potential reward for each choice and select the direction that maximizes reward. I read the Zhihu article you shared, and from a reverse-engineering perspective, if the token cost we're seeing now is linear, then MCTS probably isn't in the inference phase.

I think the second approach is more likely: using MCTS in the data processing stage. For instance, when processing training data, using MCTS strategies to find the optimal reasoning data to train the model, or integrating search strategies during reinforcement learning to help the policy model find the best reasoning approach. So if I had to guess, I'd say the probability of MCTS being used at the data level or during the RL process is higher than using it at inference time.

Monica:

Let's come back to Moonshot AI. We've discussed a lot about the possibility of o1 using reinforcement learning — is there any angle you feel we haven't covered?

Kimi Kong:

Let me take a step back and explain what reinforcement learning actually is. This will help everyone better understand why RL can be effective across different industries. Reinforcement learning requires several basic components.

First, you need an agent — that is, a model. In the language domain, it's an LM. In robotics, whether it's physical robots, simulations, Atari games, or Google's AlphaGo, you need an agent.

With an agent, you need an environment for it to interact with. For example, a physical robot needs to interact with the physical world around it, but the physical world is hard to model — that's why we haven't seen real robots widely deployed yet. Though I believe this field has tremendous potential, and maybe very soon we'll see a GPT-3.5 moment in the robotics domain.

More generalized environments include Atari games and Go. RL developed faster in these areas because they're well-controlled environments. In these environments, sample data is free. Running an LM to sample is expensive, but in simulation, you can do infinite sampling at any speed and frequency — even twice as fast as real time — which makes simulation a perfect reinforcement learning environment.

Finally, you need a reward to tell the model how good or bad each action is. For example, in Atari games, winning or losing is a very deterministic reward; in AlphaGo, the final outcome is also a deterministic reward. These well-controlled environments created favorable conditions for early RL research papers.

There was substantial progress in reinforcement learning. The first was the DQN paper, and then various evolutions of DQN followed, like Double DQN and Dueling DQN. People weren't just working on value functions anymore — they started doing policy networks too, like REINFORCE. Then people realized you don't just need a policy network, you also need a value network, and you need to combine the two into an actor-critic approach.

This could then evolve into on-policy, off-policy, or deterministic, stochastic directions. For instance, DeepMind's DDPG, and the work on TRPO and PPO by an RL researcher I greatly admire who was originally at OpenAI and later went to Anthropic.

At the end of the day, RL hasn't seen much algorithmic development in many years. The most SOTA was probably the SAC paper from Sergey Levine's lab, around 2018 or 2019. Since then, there haven't been more significant advances at the algorithmic level of reinforcement learning.

Now people mainly focus on RL applications in specific domains, especially in the LM field where it's quite hot.

If you look back at this question, AlphaGo is actually very similar to LMs. AlphaGo also had two steps — there was a pre-training step, called the pre-training phase at the time, which was imitation learning, learning from expert games. With this good base model, the question became how to do better than humans. This goes back to what Eric Li said earlier — we can let the model improve itself through self-play.

After AlphaGo, they wondered if they could remove this pre-training step entirely and train purely with reinforcement learning, which led to AlphaGo Zero. Then they wondered if they could make it play more than one type of game, which led to AlphaZero, capable of playing shogi, Go, and chess.

Their ultimate solution was MuZero — not only learning how to win games, but also learning a simulation network, meaning given an environment state and an action to take, the model can predict what the next state will be.

You might then wonder: could LMs also completely skip pre-training and be trained with reinforcement learning, like AlphaZero, purely through self-play?

But this is actually very difficult, because reinforcement learning requires a deterministic reward function, and LMs are very hard to equip with such a reward function.

Second, you need a controlled environment. For Atari games or Go, I have a perfect controlled environment, but for LMs, humans are the environment — I can't keep answering questions for an LM all the time. Although there are some tricks to do self-play, like having two LMs question and answer each other, because these two conditions are missing, RL can currently only do alignment work for LMs, and can't completely solve LM problems through pure self-play reinforcement learning.

That's the evolution of reinforcement learning and its application to LMs.

06 The Scalable Pattern for Reward Models Might Be

"Human in the Loop Combined with AI Feedback"

Monica:

Could you expand on the application of robotics in reinforcement learning? Drawing from your previous experience working on LLM robotics, what insights could be borrowed?

Kimi Kong:

That's a good question. I think reinforcement learning is fundamentally a general technique, not limited to robotic RL. It's about achieving goals through an agent, environment, and reward function within your defined environment.

Honestly, I really miss the days when I was doing RL research — the environment was so pure. It was simple: you just needed to win the game, the reward was completely deterministic, you didn't even need to think about reward function design. Now it's become more complex, but this complexity also brings greater potential, because the reward model is no longer deterministic, not limited to playing games, but can potentially generalize to other domains.

In robotics RL, there are mainly two research directions.

The first is locomotion, like the work by Stanford's Tony and his group. This direction actually has little to do with language models — it's mainly through imitation learning plus RL approaches, requiring human demonstrations to teach robots how to manipulate and move.

The other direction is planning, like Google DeepMind's early SayCan project. This type of work requires explicit task description rather than demonstration. In planning tasks, LLMs are a very popular approach, from early work like SayCan and Code as Policies, to later PaLM-E, and then the RT-1, RT-2, RT-X series.

In practical applications, because robot data is limited, we don't want to use purely robot data and cause model performance to degrade, so we do co-fine tuning with robot data and vision data along with VQA tasks, then collect reinforcement learning data to refine the model.

At its core, there isn't really a fundamental difference — it's mainly about application scenarios and data formats. This data might not be traditional tokens, but rather robot motor force, torque, or sensor data. But the backbone is all transformer architecture, and they all use reinforcement learning training techniques to help the model converge better in a specific domain.

Monica:

You also mentioned self-play just now. What's its research history and industry application in the reinforcement learning field? Do you think o1 uses self-play technology?

Kimi Kong:

That's hard to say for certain. But if I were doing this, I would definitely use self-play, because it allows you to continuously scale and refine the process. The biggest advantage of reinforcement learning is that it enables incremental improvement at every step. This is different from SFT, where you finish training one epoch and it's done. Instead of getting there in one go, we can do better — the data and queries are still there, you can run the model on that query again, annotate it again, and do countless rounds of self-play on that same query.

I think self-play is a scalable reinforcement learning training technique, and it's a very good technique in the language model domain.

Monica:

So what's the relationship between that and the CoT and reflection we discussed earlier?

Kimi Kong:

That's a good question. I think when discussing CoT, it's more as a prompting technique — meaning I want to prompt this model to help me do something. You can use CoT to solve problems, or you can use CoT to generate synthetic data to train models.

But self-play is more of a training technique, used when training reinforcement learning models to continuously push forward the reinforcement learning steps. I think these are two relatively independent topics. Feel free to correct me if I'm wrong.

Monica:

Regarding self-play, I'd like to hear your views on its relationship with CoT, and its role in o1 or in improving model capabilities going forward. I'm thinking of DeepMind Danny's recent paper Train of Thought empowers Transformers to solve inherently serial problems — it was very eye-catching on Twitter, saying "Performance limit when scaling our inference sky is the limit." This paper is essentially about how CoT enables transformers to improve their capabilities. What's its relationship with the self-play that Kimi just mentioned?

Eric Li:

I think CoT and self-play are two relatively independent approaches. CoT is more about using chain-of-thought, increasing inference-time computation, to let the model solve problems that are inherently difficult. Self-play is more like AlphaZero before, where through self-play you can continuously incrementally improve yourself, like Go playing level.

Regarding o1, I'm not sure whether they used self-play, but looking at the MCTS lineage, in the direction of LM plus reinforcement learning, people still tend to draw on the successful experiences of the previous generation of reinforcement learning. MCTS became very popular when DeepMind did AlphaZero. I believe self-play, even if OpenAI hasn't used it in o1 yet, is a very promising approach. There are probably already many people researching it, and I'm quite optimistic about its future — it can serve as a strategy for model self-improvement.

I haven't fully read this paper myself, just looked at the abstract, but I think it's a theoretically very interesting article. It can tell us what the entire AI academic community needs — some theoretical articles to reveal where the upper limits of our existing model capabilities lie. For me, this is a very insightful article. It can at least answer one thing: the transformer plus CoT architecture has very strong expressive power.

Of course, I've also seen people saying this might be similar to the situation with deep neural networks back then. But I think this paper tells us mathematically where the upper limit is, which is equivalent to motivating us to design better CoT and better transformer architectures next. This shifts the question from whether it can be solved to how to solve it better.

From the perspective of computational irreducibility, for many problems, if you want to get an answer, there may be a minimal computation cost requirement. For example, if you want to simulate a fluid dynamics state, while guaranteeing certain precision, there's a non-zero lower bound on the required computational cost. This also has corresponding manifestations in the CoT domain: for complex problems, you do need some additional computation to get relatively accurate solutions. This is my understanding of why CoT is considered a form of adaptive computation.

Su Hui:

Let me first discuss the Sky paper. This paper generated a lot of discussion on Twitter, including researchers like Tianyuandong expressing opposition. They believe this paper's claim is essentially similar to the statement that "a two-layer neural network can fit any function" — it's just constructing a position to fit a specific target function.

But whether this solution can theoretically be reached, or whether a better path can be found, is not guaranteed. Although you can solve the answer through exhaustive methods, this approach isn't realistic. What we really need is the ability to accurately and directly give the answer. I somewhat agree with this view — that having an answer and being able to correctly solve it through existing methods are two different things. You can't say that some random probability appearing equals achieving this function.

Regarding the use of play, I noticed that searching for "play" on OpenAI's official website goes back to 2017-2018 and continued until 2022. Although OpenAI later didn't officially acknowledge using play, this relates to the background of newer generation researchers like Noam Brown. They previously mainly worked on Deep AI-related zero-sum game research, and these researchers' research tastes and paths won't change significantly in the short term.

In a recent YouTube speech, Noam Brown mentioned an important conclusion about LLMs at the end: He believes you need to ensure both the generator and verifier are sufficiently powerful to achieve the goal. Looking at the timeline, we have now reached the preconditions he previously proposed, so applying this method in LLMs is entirely reasonable.

Cage:

Raw Model (referring to the raw model after pre-training without any domain-specific fine-tuning or optimization) will indeed be a major research direction in the future. Earlier Monica asked what everyone thinks of GPT-4o's performance? Both guests' answers were related to mathematical reasoning and coding.

Reasoning and math raw are relatively easy to define — they inherently have verifiers that can directly give a result saying whether it's right or wrong. But other domains are hard to have such clear reward models. I'm wondering what the guests think about whether reward models can generalize and achieve scalability across domains in the future?

Su Hui:

This kind of process reward model has definitely been practiced at scale. Mathematically speaking, including later work like Critic GPT, these are all part of the same lineage. Our base models, say GPT-4, are already strong generator models, and the verifier models are also trained based on GPT-4 level models. Although its reward model still gives discrete signals, the process is more believable, because it may give stronger confidence through embedded reasoning, and then ultimately give the signal.

This in some sense breaks away from the previous RLHF training paradigm. Previously RLHF was built on the binary Bradley-Terry statistical model — you had to collect some preference data, at least ranking between two options. But if you go this route, where a strong model repeatedly reasons and gives results, it may not need this training pattern.

I'm using a general-purpose model, mainly for the purpose of scoring, and this scoring is likely based on my own relatively strong set of rules, and I should be giving results through my own generated chain-of-thought. I think this may be a somewhat different approach.

Eric Li:

I very much agree — reward model is an underestimated problem, especially compared to those cases like verifiable math problems or coding that are relatively easy to verify.

Now many people are researching AI feedback, because we hope that in certain domains, AI can indeed give more effective feedback than humans. For example, consider a scenario where I want to write two science fiction novels, two versions, and need to judge which is better. For humans, reading several million words is quite difficult and very time-consuming.

But for an LLM, it can help you quickly process data, understand text content, and produce summaries. I think a scalable approach going forward is "human-in-the-loop combined with AI feedback" — in situations where humans would need to spend a very long time, or where ordinary people can't easily discern preferences, we can use AI to reduce the difficulty to a level that humans can detect and comprehend, and then have humans provide their own preferences. I think this will be a more scalable approach for certain domains.

Multi-Agent Is a Transitional State Before Supermodels Emerge

Cage:

You've all helped us gradually piece together several discrete technical points into something like a panoramic view. Building on that, I'd like to ask about something else that's been discussed quite a bit recently — there's even been debate on Twitter about it. Do people think o1 is a single model, or could it be a Multi-Agent multi-system?

Because on one hand, we saw OpenAI's AMA hour where they said it's just one model. But at the same time, Noam Brown — whom Suhui mentioned earlier, this young researcher — recently posted a job opening for Multi-Agent research.

And when you think about the AlphaGo and AlphaZero systems, their network wasn't single-objective either. It had both a policy network and a value network, simultaneously handling task execution and evaluation.

So I'd like to ask the panel: if we're trying to reproduce o1, is it likely a system combining multiple models, or could it be one neural network solving everything?

Monica:

Pure speculation — I'm not responsible for the guess. I saw a similar speculative post on Zhihu where the author said "this is pure speculation, and if you train your company into bankruptcy following this, I'm not responsible." We're all just here to hear how each other thinks through this.

Kimi Kong:

I mostly agree with what Eric said earlier about the five different levels of AI development. The first level, conversation, is already complete. We're now in the second level, the reasoner stage. Based on that roadmap, my personal view is that it's more likely a single large model, while the next stage might be Multi-Agent — or at least a single-agent model.

Monica:

Is your guess coming more from the performance side and OpenAI's technical aesthetic path?

Kimi Kong:

Yes, more from a strategic angle. I think you have to do one thing at a time — first build a very good base chatbot model, then use it to prompt out reasoning data. Once you have powerful reasoning, you can do better tool use and function calls, which is probably what the next model version will tackle.

I don't think OpenAI's research direction is an over-engineered solution. The industry hasn't yet found the best way to train Multi-Agent LLMs. I'm more inclined to solve the low-hanging fruit first — get a powerful reasoning model as the foundation, then step by step execute the roadmap, eventually reaching level 5.

Monica:

I Googled it — AI development is divided into Level 2 Reasoners, Level 3 Agents, Level 4 Innovators, Level 5 Organizations. So we're still in the reasoner and agents stages.

Kimi Kong:

Right, probably somewhere between 2.1 and 2.5.

Monica:

Actually at the application layer, when we use Multi-Agent architectures, we encounter some pushback. People say Multi-Agent just adds system complexity, and communication between agents can create a lot of waste. The fundamental reason is that your individual agent itself isn't good enough — if you had a really capable agent, you wouldn't need Multi-Agent in many scenarios. It's like Tanu Robotics' autonomous vehicles, where model replacement requires considering these system architecture trade-offs.

Kimi Kong:

I think there are roughly several questions to address. First, we can go through the history of the Multi-Agent field.

Multi-Agent is an important topic in classical reinforcement learning. The most famous paper is probably by David Silver — a paper I deeply admire, MADDPG (Multi-Agent Deep Deterministic Policy Gradient). Compared to DDPG (Deterministic Policy Gradient), which trains a single agent to do one thing in a single environment, MADDPG can train many agents to complete a non-zero-sum cooperative task. To make this problem tractable, it made a lot of simplifying assumptions; otherwise it would be computationally infeasible.

Regarding Multi-Agent, I know some background. After MADDPG, a lot of Multi-Agent research emerged, but I didn't continue following that direction. When it comes to Multi-Agent applications in language models, essentially you can prompt a model to do one thing, right?

First you have it do Step 1, putting your generative model hat on to generate content. After completing step one, through CoT you enter step two, telling it now put your critic hat on to evaluate the result. This version of the model needs to think carefully — if it thinks everything is correct, it gives the final result; otherwise it goes back to step one and tries again.

Actually in this process, the model is doing many things, right? Rather than Multi-Agent, it's more like Multi-Task. The problem is that when the model does Multi-Task, it may not easily shift attention from generation to criticism. What people are doing now with Multi-Agent in the language model field is mainly through prompting different personas, separating generator and critic: the generator's job is to generate content, while the critic focuses on evaluating results. I think this is a very interesting direction, especially for developing next-generation agents, though I may not have fully kept up with the latest Multi-Agent research on language models.

Translator's note: A fascinating point — separating writing and editing. Kevin Kelly has said almost exactly the same thing.

I actually lean more toward seeing Single-Agent breakthroughs in the near term, like traditional o1. Because o1's initial breakthroughs all emerged in the Single-Agent domain. Once you have a very strong agent, it becomes possible to easily generalize to Multi-Agent systems using similar training methods.

Monica:

Is OpenAI o1 end-to-end or Multi-Agent?

Eric Li:

My guess is relatively conservative — I think it's probably single or two agents, but unlikely to be a more complex Multi-Agent system.

OpenAI previously did a lot of work on reasoning and verification, such as framework setups where two agents solve math or coding problems. I think o1 is likely just a single agent, but at inference time it may incorporate its critic or light supervision for verification. As for why people challenge the Multi-Agent notion, it depends on the capability level of the single agent.

I believe that in the present and foreseeable future, Multi-Agent will outperform Single-Agent capabilities. Even humans need cooperation and division of labor to produce better results. Someone at Einstein's level still makes mistakes. Since I did a physics PhD, I know that doing quantum physics in the last century required a whole team of people cooperating and dividing labor to truly build up complete physical theory. So before our Single-Agent reaches Einstein-level intelligence, I believe Multi-Agent performance will definitely be better, because it can provide different perspectives and lines of thinking.

Of course, if a superhuman-level Single-Agent emerges in the future, the ultimate evolutionary form might regress back to Single-Agent — that's a more philosophical-level thought.

Suhui:

In my view, there's no need to doubt this — they're fundamentally one model. Including previous end-to-end models, increasingly more evidence supports this point. I personally tend to believe they must be a single model, although multiple models can indeed improve performance on many tasks at this stage. Setting up various roles in formal workflows to cooperate and solve problems together — I see this as a product of the transitional phase.

If the goal is the stars, if it's AGI, then the final model shouldn't be multiple AGI models working together — it should be a single model handling everything, omniscient and omnipotent.

Right now, people use Multi-Agent or other methods mainly to address corner cases or instability in intermediate reasoning steps, but these are all transitional approaches.

For instance, in tool use, a model might struggle to properly understand and invoke functions because it only grasps the basic functional descriptions of function calls or tool use. Much agent optimization work follows human usage patterns, continuously summarizing user behavior and feedback, then adding this information to prompts to refine function descriptions and invocation probabilities.

But after o1's release, many such cases will be obviated, because the model is capable enough to invoke tools with perfect accuracy.

07 Why Is Gaming Ability

Worth Watching for LLMs?

Monica:

Recently there's been a project using o1-preview to play Black Myth: Wukong, though combining games with LLMs isn't exactly new. With more recent LLMs that have stronger reasoning capabilities being used for gaming, has anything struck you as particularly impressive? Also, for using games to generate training data, how might this help with further improvements now that we have new paradigms like o1-preview?

Suhui:

I saw the news and looked up the related paper — it turns out they used GPT-4o. The implementation works by taking game screenshots as input, using a vision model for scene understanding, then generating actions as Python code to control the game. Doing this with GPT-4o would indeed be quite expensive.

AI has actually been good at games for a long time — starting with Dota, then StarCraft. Previously everyone thought you needed massive amounts of gameplay for reinforcement learning, but that's changed. Earlier approaches didn't use language models at all; you had to define the game's state space yourself and use pure reinforcement learning.

This Black Myth case is special because it directly used a pre-trained vision model and language model with no additional training. What's most surprising is how powerful models' visual and textual understanding has become. I think the next step — using stronger models to play games humans enjoy — will likely surpass human-level performance. And crucially, it won't require specialized training on specific games. This represents a new watershed moment.

Monica:

I know everyone mentioned needing more new types of multi-step data earlier, so I'm curious — in fully simulated game environments, is it relatively easier to collect this kind of step-by-step data?

Suhui:

Yes, data collection would definitely be easier. This reminds me of AlphaGo's evolution — early AlphaGo couldn't do without human game records, but by the AlphaZero era it needed no human games at all. Open-world games are similar: if you take the AlphaGo route, you need human operation records to learn from.

But with the AlphaZero approach, you only need to define the action space and let the AI explore autonomously from scratch in an open world. These are two completely different approaches.

Eric Li:

On the topic of using large models to play games, I think there are two very impressive points. First, as Suhui just mentioned, it didn't specially train a model with reinforcement learning to play the game — this is completely different from Google DeepMind's approach with Dota. It relies entirely on in-context learning capability to solve sequential decision-making problems.

This demonstrates a very impressive capability of Foundation models, showing its planning ability. It can plan which action to take first when fighting a monster, then which action next, to ultimately win. This showcases not just image understanding, but more importantly strong decision-making capability.

Regarding using gameplay data to get more training data, Jason Wei previously did a paper studying how to learn real-world physics knowledge. They used physics simulator engines to get signal. More broadly, for a simulated AI system or single agent, when it interacts with an open world, the collected data is particularly interesting. This feedback can effectively produce reasoning data, because whether it's gameplay or open-world problems, it's relatively easy to verify the final result's correctness. This differs from human feedback that only tells you pairwise which is better — like with gaming, coding, and math, you know whether you ultimately won or lost. This clear signal can help us better synthetically generate reasoning and planning data.

Monica:

In current large model training, is gameplay data used much?

Eric Li:

I haven't seen many people using this area yet. I don't know what the situation is at OpenAI or other companies. It feels like Google, because it values its existing product lines so much, prioritizes improving those products. But I think this is a fairly interesting direction worth trying.

Monica:

Everyone mentioned that large model companies have started using data, so I assumed a considerable portion might come from gameplay data.

Eric Li:

Currently synthetic data is more used for activating model generation (how to activate image generation model generated AI model). While simulation data is still relatively rare, projects like the Multi-Agents and Stanford Town we mentioned earlier demonstrate that in the future, data could be generated through simulating societies. These can all be done through Multi-Agent simulation, using simulators and game engines, combined with physics engines.

08 OpenAI o1's New RL Paradigm

Raises the Bar for Followers

Monica:

Today we've invited DeepMind guests with deep research in io and ncts. Some time ago people discussed that Google actually started research along similar lines to o1 relatively early. For example, Google DeepMind published the paper "Compute Optimally Can Be More Efficient Than Scaling Model Parameters Inference Time." I'm curious how the researchers here view this relationship — it seems Google started this research path early, so why was OpenAI the one to first deliver o1?

Kimi Kong:

(slightly hesitant) I'll summarize in one sentence and leave the rest to everyone's imagination: the transformer was invented by Google, but GPT was first trained by OpenAI. You can all imagine for yourselves why they released o1 first and not us.

Monica:

What was the attention and reception for this work before o1 came out? It sounds like it didn't receive much attention.

Kimi Kong:

(laughs, momentarily speechless) I may have heard similar research, like these small research projects people did. For example, these Google papers all demonstrated on specific domain datasets that reasoning helps.

But I didn't see any very large-scale attempt. Fundamentally, it's about whether you want to publish a paper proving it works on a clean dataset, or whether you truly want to solve nasty problems and do 10X or 100X scale-up — I think this requires a different mindset.

Eric Li:

I had seen some related research within Google about reducing inference cost, but they were relatively scattered, independent analyses. However, before o1 came out, I indeed hadn't paid attention to this particular paper. This paper gives a more systematic analysis and summarizes things very well.

From a research direction perspective, since o1's PR was so effective, Google will definitely improve its models' reasoning capabilities, striving to match or exceed o1. But for strategies scaling inference cost, in some commercial scenarios — particularly those with high latency requirements — this approach isn't suitable. By comparison, what people may pay more attention to is performance improvements in Gemini or within their own domains.

Monica:

So can this be understood as o1's emergence making this direction an industry consensus? Suhui, anything to add?

Suhui:

Latency is indeed a critical issue. If you can find an application where users accept waiting 10 minutes, 20 minutes, or longer, and ultimately complete a good task, or design products with some offline operations, this might create new product opportunities.

But for existing product forms like role-playing or general chatbots, this approach would be difficult to implement.

However, if you can migrate this training logic framework to improving the Pareto frontier, it would be quite valuable — for example, making trade-offs between safety and reasoning capability, using this training approach to raise the ceiling. In specific application scenarios, such as those needing to balance safety and role-playing capabilities, this approach is viable.

Cage:

The latency issue everyone discussed earlier — I completely agree. I experienced this firsthand after connecting Cursor to o1. The difference was stark compared to before. Previously, auto-completion and composer were both very fast; now it takes a long time to think, so you'd need massive performance improvements to make up for that time trade-off.

From the perspective of major tech companies and commercialization, catching up to GPT-3.5 and GPT-4 previously took roughly half a year to a year. Will the AI community catch up faster to o1's reinforcement learning approach for improving reasoning capabilities?

Monica:

What does this new paradigm mean for those playing catch-up?

Su Hui:

I'm inclined to say it's harder.

First, you need to build on top of a stronger base model. With a weak model, you won't have a strong reward model, so the returns from this approach will be minimal and generalization unlikely.

Second, if you're using strategies like MCTS, this is an extremely GPU-bound inference-time training method. Your MFU or GPU utilization will be extremely low. Compared to the relatively good GPU utilization already achieved when training Dense or MoE models today, the compute overhead won't be lower than pre-training — it might even be higher. For many companies this is a bigger challenge, because you might essentially be doubling your pre-training compute costs.

Cage:

On the GPU utilization being low yet consuming more resources — can you explain why o1's training approach leads to this?

Su Hui:

Because during the sample and decode processes, GPU utilization is much lower than during the training phase. This process needs to be integrated into training, which creates a lot of waiting time.

Monica:

The compute requirements are very high, but that calls for extremely powerful training chips plus very large clusters. You see OpenAI, Meta all building hundred-thousand-scale clusters. In the post-training stage, if it's more like inference compute, wouldn't the requirements for GPU performance and cluster scale be relatively lower?

Su Hui:

This is a major engineering challenge. We're not talking about inference deployment after training is complete — we're talking about a training-inference integrated process. For pure inference, you can use lower-performance GPUs, mainly needing to handle communication. But at scale during training, because this process is embedded within training — it's not realistic to engineer a solution where you generate text through inference and then process it on separate machines. So you still need the best GPUs for reinforcement learning training.

Kimi Kong:

I think any task comes down to a few major components: data, model, and training framework. Like Su Hui mentioned regarding training compute challenges, he also touched on how base models struggle to access the latest open-source SOTA models. I'm wondering if the most SOTA open-source model right now is Llama 405B?

If you're at Google or OpenAI, you train one biggest model and never need to consider which base model to use. But right now there isn't a good open-source base model, which means you may have already gone down many wrong paths just in selecting your base model.

On the data side, you can see OpenAI purposely hid its reasoning content, only giving you summaries of the reasoning. I think they did this because if you had this reasoning data, training would be much easier — but without it, you have to research this problem from scratch.

Overall this is a very challenging undertaking. If all three of these points are challenging, as a follower it becomes even harder. Speaking of Su Hui's team being followers — for us, aren't we followers too right now?

Monica:

What does Eric think?

Eric Li:

I think o1's difficulty and GPT-4's difficulty are both very high, but the hard parts are different. When GPT-4 came out, only OpenAI had built a multimodal model. To achieve multimodal capability, whether in pre-training, post-training, SFT, or reinforcement learning, every training stage needed to be done.

The main difficulty was data, because obtaining the best reasoning data is more resource-intensive than outcome-based human feedback. Another issue is the implementation method — unlike last year's relatively clear path from text-only to multimodal models. Back then everyone already knew how to do modality fusion, how to process these datasets, but now people are still guessing how it's actually implemented and what principles underlie it.

So I think the difficulties are mainly: first, building such a dataset; second, because there are many possible implementation paths, more research investment is needed to identify the optimal one.

For smaller companies there's another challenge: the importance of reinforcement learning. Previously many startups or resource-constrained companies didn't do reinforcement learning, instead using more off-policy methods like DPO. If reinforcement learning is now being emphasized as this important, whether we must do RLHF rather than RL-free methods is a major challenge for small companies.

Monica:

In the process of catching up to o1, what do you think are the most overestimated and underestimated aspects?

Eric Li:

I think the most underestimated aspect is data — specifically, data for judging reasoning quality. Previously with RLHF, some scenarios or startups could still obtain human feedback, but getting high-quality reasoning feedback data is much more difficult. As for overestimated aspects... nothing is overestimated. It's just hard.

Su Hui:

I've also said this before — I think people underestimate the engineering challenges. From some perspectives now, the training engineering challenges are actually enormous. You need to build on top of a GPT-4-level model, and you need to master training itself to continue advancing.

Kimi Kong:

I strongly agree with Eric and Su Hui. This is both a very hard science problem and a very hard engineering problem.

The science challenge lies in how to filter high-quality data, while the engineering challenge is that training needs to incorporate inference — meaning you must be a hexagonal warrior with no weak points to pull this off.

Hopes for Development in the Field Over the Next 1-3 Years

Monica:

We've discussed a lot of interpretations and speculations. Now let's talk about hopes for the future. After seeing o1 demonstrate these capabilities, what developments do you most hope to see in this field in the near term and over the next three years? What difficult problems do you most hope get solved?

Kimi Kong:

I think within one year, coding will likely become a commodity — a skill that everyone can possess. I was chatting with a PM on my team earlier, and he said: I can write code myself with Cursor, I don't need you guys to build prototypes for me. Though he was joking when talking about his personal projects, I think this scenario might really materialize within a year, perhaps.

Actually I'm a robotics researcher by training, and I'm very much looking forward to greater progress in combining large language models with robotics, particularly in the embodiment direction. But within one to three years, I think the hardest problem to solve remains domain data. You know, most of the recipes are already laid out on the table — whether big companies or the open-source community are using roughly similar recipes.

You can choose your recipe, but you know, recipes need raw ingredients to cook, and here the raw ingredient is data. When a domain lacks good data, or data is hard to collect, or data hasn't been digitized yet — this is the biggest challenge. Specifically for Embodied Robotics, while challenging, it's not insurmountable. For robots, the data hasn't been well digitized yet, but that process has already begun.

This reminds me of the GPT-1, 2, 3 development stages, when everyone was also continuously expanding data quality and quantity. So I'm very much looking forward to seeing my robotics colleagues develop an amazing emergent embodied intelligence model.

Monica:

I recently invested in a robotics company. Seeing robot data gradually being digitized makes me very relieved, because we talk every day about how hard robot data is.

Kimi Kong:

RT-X is indeed a good step forward. Some people from the RT-X team later left to start Physical Intelligence. On the Facebook Todos team, what impressed me was a Vietnamese-American member who was the initiator of RT-X.

Monica:

Can you explain in one sentence what RT-X specifically does?

Kimi Kong:

RT-X is open-source. Traditionally, robotics scientists needed to collect their own datasets — for example, Tony gathered a bunch of cooking, table-opening, and shaving datasets, then trained imitation models.

Just as Hugging Face does summarization and semantic understanding in NLP, they collaborated with 17 labs worldwide, integrating dozens of robot datasets to establish a unified standard robot dataset, totaling two million robotic trajectory demonstrations.

By comparison, PaLM-E spent 18 months collecting roughly 150K human demonstrations. But compared to language models — take Chinchilla's scaling law, with its trillions of tokens — robotics is still leagues behind. And that's precisely why it's exciting. It's a not-fair game. Everybody can win.

Monica:

It's a bit like ImageNet for robotics.

Kimi Kong:

Exactly. Everyone's starting from the same line. The tech giants are on the same starting line as you. That's why I'm so excited — I look forward to seeing robots land and find real applications in the next three to five years, and I look forward to my colleagues producing even more stunning work.

Monica:

I'm very much looking forward to seeing when you might return to robotics research, your old stomping ground.

Kimi Kong:

I've been following the space closely. From a technology perspective, I don't think there's much difference — it's all AI applied to different industries. The modality of robotics is fundamentally a multimodal problem. I don't see that much distinction between robot models and VQA or VLM models. At the end of the day, it's the same technology applied to different datasets.

For me, although robotics is my passion, what excites me more is the problem that remains once you abstract robotics away — my passion lies in reinforcement learning, in how to use RL to solve foundational state-action World State Topological Agent Problems.

Su Hui:

Within the next year, I very much hope to see breakthrough progress in multimodal reasoning. A lot of prior research has shown that introducing multimodal tokens doesn't actually improve language model capabilities, which has been somewhat disappointing to many people — you mix modalities, compute goes up, but capability in any single modality doesn't improve.

Within a year, our training data resources will scale up considerably. But we should note that human learning doesn't require anywhere near this much data. Current model training is flooded with meaningless data — press releases, random strings — all of which gets learned by the model, wasting enormous resources.

I very much hope to see major breakthroughs in data work this year, finding truly representative data that can achieve what massive datasets do today with a much smaller volume. Looking three years out, I'm fairly optimistic — I hope to see models approaching AGI, solving all problems so we don't have to work anymore.

Monica:

That's three years? Careful your boss doesn't make that your KPI.

Kimi Kong:

I'm very curious about the multimodal question — multimodal data still makes up a very small share of training right now. I'm also curious why no one is doing scaling research on vision encoders, given that vision encoder sizes are so much smaller than text encoders. That's a personal curiosity of mine.

Su Hui:

I don't actually know the specific reasons, but I think it's a very promising direction. I'm quite bullish on scaling up encoders.

Kimi Kong:

Right, because right now they're all in the range of a few hundred million parameters. I may not know Gemini particularly well, but those open-source models that are tens of billions of parameters — their vision encoders are basically in the few-hundred-million to 1B range.

Su Hui:

That is indeed quite surprising. I think one major reason is that vision encoders pose fairly significant engineering challenges.

Kimi Kong:

Interesting, good to know.

Monica:

Eric.

Eric Li:

Personally, within one year I'm quite bullish on multimodal reasoning. I've read a lot of papers and found that models reason very well on text, but once you add multimodality, the performance drops. This involves two problems simultaneously: alignment between modalities, and reasoning — mix them together and it gets much more complex. But with o1 as a shining example, I believe many people will consider how to apply related techniques more extensively to multimodal RLHF.

Another direction I'm optimistic about is Multi-Agents. Previous agents didn't work that well mainly because foundational capabilities — reasoning ability, for instance — weren't strong enough. I estimate that within this year, other competitors will also release o1-level models. For startups or other teams, having a more powerful Multi-Agent foundation should be more promising. I look forward to this unlocking some new application scenarios, or making breakthroughs on tasks requiring high accuracy that previously couldn't be achieved.

Over the next three years, I hope to see AGI playing a greater role as an innovator — autonomously discovering new things or conducting frontier research. I've noticed some relevant papers recently on using AI to help us do research, but it's still at a fairly rudimentary stage. Once reasoning and Multi-Agent system architectures mature more, AI scientists may bring us unexpected results.

Monica:

Do you think an AI scientist can be achieved just by improving reasoning capability? As a scientist capable of defining and solving problems, what other abilities are needed?

Eric Li:

The papers AI scientists write now are more of the hype-driven variety — simply combining A and B. To tackle thornier open questions, we need AI with deeper thinking ability and the capacity to tear things down and start over. Also the ability to ask better questions, not merely solve them. With better reasoning capability, AI can think longer and deeper, which would lead to a qualitative leap in its ability to pose questions and solutions.

Monica:

What are some harder problems you think will remain unsolved in one to three years?

Eric Li:

I think the innovator problem itself is extremely challenging. One particularly difficult aspect is getting AI to do more than just retrieve from its pre-training data — to actually question whether what it learned is correct or already outdated. I think this may be a very difficult point for AI to reach innovator level: getting AI to question, to vigorously challenge knowledge it acquired through SFT and pre-training. If this can be achieved, there should be major progress.

Monica:

Echo made an excellent point earlier. As an investor, in my discussions with entrepreneurs, I've found that o1 is more of a GPT moment than a ChatGPT moment. It solves scenarios requiring higher-level reasoning, which is quite different from the chatbot scenarios ChatGPT demonstrated. For these scenarios, product design can't just solve everything with a search bar like a chatbot. We need to consider how to incorporate human feedback into long inference reasoning chains.

These are all product problems from the GPT-to-ChatGPT journey that deserve discussion across the entire industry ecosystem. I think this is more suitable for startups than big tech, because the big players are all going all-in on the GPT models themselves. There are still many opportunities at the product level. Cage has been doing a lot of research in this area — I'd like to invite you to share your expectations for the future as well.

Cage:

I'll split this into coding and everything else. For coding, I'm quite convinced coding capability will continue to improve. People who can code probably make up less than 1% of the world, but people with product needs far exceed that proportion. Could there be new technological breakthroughs and products to bridge this gap? Take Cursor — novice users still can't really use it effectively. So there may be lower-barrier, more democratized products emerging, something like Canva.

Second, what I'm most looking forward to is whether reward models can generalize beyond mass code to other problems.

This generalization might happen in two ways: one through model-level improvements from OpenAI, Anthropic, Google and others; two through open APIs or other forms that let enterprise users jointly contribute high-quality reasoning data, leading to improvements in finance, law, and other domains. I hope to see some signals of breakthrough within a year.

This isn't a clearly strongly-pushed area where I'd want to see progress. On a three-year horizon, what I'm most excited about is AI actually being able to complete high-value research tasks for me, perhaps lasting a day, a week, or a month. In the process, if the AI encounters any issues, it can proactively email me, I give a comment, and it continues completing the task. This ties back to what Echo and previous guests mentioned: there aren't yet products where users are willing to accept such high latency, but if AI can truly do high-value tasks, there may be breakthroughs in industry research or even human scientific problems.

I look forward to technological and product breakthroughs that enable asynchronous collaboration between humans and AI. This might give rise to a new AI Agent operating system or UI/UX design pattern — that's what I'm most excited about within three years.

Monica:

Everyone shared their expectations for the future from different angles. Today we planned for two hours and ended up talking for over three — thank you all so much. I found it enormously enlightening, and I hope it brought something to our listeners as well.

The more breakthroughs in new paradigms and advances in model capabilities, the more they ignite our imagination and anticipation — and the more they inspire us to build and innovate on top of them.

Recommended Reading