ZhenFund's Yusen Dai: A Long Conversation on AI Agents — Every Industry Will Face Its "Sedol Moment" (Part 1)

真格基金·March 31, 2025

In the Agent era, "attention is not all you need."

Last month, Yusen Dai, Managing Partner at ZhenFund, sat down with LatePost for a long conversation about AI and agents. We've transcribed the full interview, which we're publishing in two parts.

From last year to now, there have been two pivotal milestones: o1 and R1. Each has reshaped the entire AI industry in distinct ways:

First, o1 introduced reinforcement learning into large language models, opening up new scaling laws beyond pre-training — specifically post-training and test-time computing (computation during the inference phase) — dramatically improving model reasoning capabilities.
Second, DeepSeek R1, also a reasoning model, was released as a powerful open-source alternative at extremely low cost, sparking massive public interest and forcing many to reassess the most critical question in the foundation model space: improving model capabilities. R1's open-source release, alongside the detailed technical report for another reasoning model, Kimi-k1.5, made clear to the entire field that certain approaches were dead ends — neither used methods like Monte Carlo Tree Search.

In this episode, Yusen and LatePost start from o1 and R1, discussing how their combined improvements in reasoning capability, cost reduction, and concurrent advances in programming and tool-use capabilities have opened up the application prospects for agents in 2025.

Yusen shares in detail his current observations on agent opportunities, as well as how large and small AI companies are adjusting their strategies amid the shifts brought by DeepSeek's open-source ecosystem.

01 Lessons from OpenAI's o Series and DeepSeek's R Series

Q: Over the past nearly six months, the two most important events in AI have been OpenAI's o1 release last September and the global frenzy following DeepSeek's R1. We can start with these two most significant developments. Could you first share your view on what o1 and R1 each represent?

Yusen Dai: I think o1 first showed everyone the intelligence gains that come from applying reinforcement learning to post-training. At the time, everyone was wondering what would come after GPT-4o. When o1 emerged, it demonstrated substantial improvements in reasoning and other intelligent behaviors. Later, o3 proved that following this technical path, model capabilities could continue to improve — the margins were still wide, the room for growth still large.

I've heard that o4 mini has already finished training. From this, we can see both that reinforcement learning in post-training has achieved a post-training Scaling Law, and that as models spend more time reasoning, the quality of their answers improves — this is the test-time compute scaling law, also called inference-time Scaling Law. These two new scaling laws, built on top of pre-training, allow AI models to improve further.

Previously, leading companies more or less knew that reinforcement learning was quite useful and could boost model performance. But only after o1 did everyone confirm that this path truly works. I believe the reasoning capability improvements brought by the o series are the key to unlocking the agent product form. If a model's thinking ability isn't strong enough, it can't autonomously use tools, make plans, or check whether it's completed its work — yet these are all essential for agent products. So it took the o series to first elevate the model's thinking ability before this new product form could be unlocked.

Q: What's the approximate difference between o4 and o3? Or what is it mainly optimizing and iterating on?

Yusen Dai: Recently there have been some rumors that o4 mini, for instance, might have inference times reaching several hours. This got me thinking: what's the difference between excellent humans and average ones? Why does a PhD thesis take five years? Because a doctoral student can produce better, higher-value work given five years. But give an ordinary person ten years, and they probably still couldn't write a PhD thesis. So first, the person's foundational quality must be good; second, they need sufficient time.

We often say training a model is essentially like cultivating a smarter person. But smarter people need more time to deliver better work — this is the inference-time Scaling Law. In the o series models, like o3 and o4, the ability for models to think longer and achieve better results is gradually becoming an increasingly attainable goal.

Q: That covers o1. To briefly summarize: o1 proved that reinforcement learning has enormous potential in post-training and test-time compute scaling laws, and that this path can be pursued far. That's the value of the o series.

Next we can talk about R1. I think in some ways its impact has exceeded the o series, because R1 became a topic that everyone — truly everyone — was discussing.

Yusen Dai: I think the R series is genuinely world-class work, and it offers us many important insights. First, open source versus closed source. When they chose to open source, everyone could understand the model's training process. In the R1 and V3 training papers, we saw many things that OpenAI had long known but the public didn't. For example, DeepSeek-R1-Zero proved that without using SFT, purely applying reinforcement learning to the base model V3 could make the model output longer responses and achieve greater intelligence, realizing the reasoning Scaling Law. The "no SFT" point was an important innovation. Then there's GRPO — I heard OpenAI knew about it before, but it was DeepSeek's paper that made everyone realize GRPO was viable. Previously, when many people were discussing o1, they wondered whether it could be achieved through search methods like MCTS, or through step-by-step labeling for reinforcement like PRM. But DeepSeek generously shared that they had tried these methods and they didn't work. Actually, knowing that a path doesn't work is very important sometimes.

I recently learned a term: "one-bit information," meaning some critical information can be conveyed with just a single bit.

I think what's impressive about DeepSeek's paper is that it provided these "one-bit information" points. For example, MCTS is a dead end — at least DeepSeek tried it and it didn't work, so others don't need to waste effort on that path. This "one-bit information" reflects both DeepSeek's generous spirit of sharing and the gap between Silicon Valley and China — there may still be some "one-bit information" in Silicon Valley that we don't know. Based on what we learned last year, by mid-2024, the fact that the RL path worked was already consensus among top-tier labs in Silicon Valley, but this information probably didn't reach China until after o1 and R1 appeared. So much of the critical information in frontier exploration is hidden in these "one-bit information" points.

The spirit of sharing in open source has many benefits. On one hand, fellow model trainers have learned a tremendous amount. On the other hand, we've seen companies like WeChat and Baidu, which already had their own models, also integrate DeepSeek because of its open-source availability. This allows more people to access good models — for example, Monica, one of our portfolio companies, recently launched its domestic version using R1. Previously, many domestic application developers built products overseas because overseas had good models like GPT-4o and Claude 3.5, enabling them to make good products. Now that domestic developers have a good model like R1, their "arsenal" is richer. Moreover, open source can accelerate the entire industry's development; everyone can learn from each other and progress together.

I mentioned the first point: the victory of open source. The second point, I think, is the victory of reinforcement learning (RL). OpenAI didn't disclose the specific details of o1's training, but R1's release showed everyone that the reinforcement learning path can indeed go very far, pointing out a direction worth exploring deeply. So I see this as a major victory for RL.

Third, R1, V3, and DeepSeek as a whole have fully demonstrated the importance of team focus. When resources are constrained, people are forced to come up with more creative solutions. For example, using MoE is a way to save resources — with a traditional dense model, both inference and training costs would be much higher. Using MoE, and when facing chip "chokepoint" issues, innovating technically through approaches like MLA to enable training and inference to proceed smoothly within legal and compliant boundaries while achieving better results — this shows that resource constraints often become a driving force for innovation.

At the same time, DeepSeek is also a company that made many deliberate choices in its research direction. In 2023, many people were working on multimodal generation, AI virtual girlfriends, and other projects, with quite a few focusing on to-C product development, but DeepSeek didn't follow the crowd. They didn't launch their own app until after the R1 release. Although DeepSeek already had many GPUs, funding, and excellent people, they remained focused on improving intelligence and enhancing the model's foundational capabilities, concentrating their strength in one direction, ultimately achieving these results. This reflects not only their precise judgment of technological direction but also the good outcomes that come from firm choices and resolute commitment.

At the same time, this also shows us that young AI-native teams are capable of competing with much larger companies that have more resources and users. Previously, everyone assumed that big tech held absolute advantages in funding, talent, GPUs, and user base, and that small companies simply couldn't compete. DeepSeek isn't a small company in the ordinary sense, but relatively speaking, it's still a fairly young team, with many members being domestically trained Chinese graduate students and PhDs. This gives everyone confidence in China's talent system, which is also crucial.

Another point that matters a lot to me: DeepSeek proved that in the early stages of a technological revolution, if you can deliver a completely new, almost magical experience to users through technological progress, the results can be extraordinary. When people first used DeepSeek's R1 model — their first encounter with a reasoning model — and saw its outputs, they were genuinely impressed. That sparked organic word-of-mouth, generating massive natural traffic without spending a single dollar on advertising. The app reached tens of millions in daily active users. Meanwhile, its API was in such high demand that many were willing to pay for access, with some even proactively asking for a paid, stable version of R1. This shows how technological progress transforms product experience, which in turn drives organic user growth and natural traffic, while also giving rise to viable business models. So I believe that in the early stages of a tech revolution, the priority should be breakthroughs in technology and intelligence leadership, not polishing products and operations on top of existing capabilities.

Q: Do you think this is already consensus?

Yusen Dai: Quite a few people had raised this point before. Between 2023 and 2024, many researchers expressed that "intelligence matters — don't just gild the lily on what's already there." But I think people needed a concrete, vivid example. Before DeepSeek-R1 emerged in 2024, everyone was overly fixated on internet-era metrics like DAU, user retention, and time spent. Take the then-trendy AI virtual girlfriends and AI calling features — why did so many rush to build these? Because the data showed high retention rates and long interaction times. Naturally, talking to AI on the phone stretches out the duration. But does that really represent an advance in intelligence? At least personally, I see it more as fulfilling users' emotional needs rather than advancing intelligence. If you optimize for time spent and DAU, you won't build something like DeepSeek that pushes intelligence forward.

In China's internet sector, there's long been considerable debate. Everyone knows the enterprise services market lacks fertile soil, and users seem more willing to pay for killing time than saving it. So people habitually search for the next ByteDance. In my October 2024 LP report, I noted that we probably won't follow the ByteDance formula going forward, because ByteDance makes money by occupying user time — yet user time is finite. TikTok, Honor of Kings, and others already consume massive chunks of it. So the next innovative "killer app" might instead be something that saves users time, or creates value outside those 8 or 16 waking hours, rather than trying to steal time from TikTok. That's extremely hard to do — TikTok is formidable. In this context, DeepSeek became an excellent exemplar.

02 Agent Unlocks a Scaling Law for Capital-to-Productivity Conversion

Q: What industry and application changes will reasoning models like the o-series and R-series bring next? You've already touched on one point — the improvement in reasoning capability points toward Agent applications, which has been a frequent topic of discussion from late last year to now.

Yusen Dai: Following the framework we just discussed — technological progress unlocks new product forms. We can see how GPT gradually advanced to GPT-3, then aligned into the conversational InstructGPT, and finally GPT-3.5 unlocked the Chatbot product form. Models with strong coding capabilities, exemplified by Sonnet, unlocked products like Cursor, the programming assistant — a mutually enabling relationship; without Sonnet, Cursor couldn't have taken off. Starting from Sonnet 3.5, models began gaining certain reasoning capabilities, and the advances in o1 and subsequent o-series models made reasoning genuinely powerful. The corresponding product form this unlocks, I believe, is probably Agent.

What is an Agent? In English, "agency" carries the meaning of subjective initiative. Previously, only humans on Earth possessed subjective initiative — we know our own goals, can formulate plans, use tools, and evaluate outcomes. This is one reason humanity came to dominate the world. But now AI capabilities have gradually reached a tipping point where AI can play the role of Agent.

In my view, AI's ability to make this transition is unlocked by three technological advances:

First, reasoning. Reasoning capability is AI's foundational intelligence. Without sufficient reasoning, a cascade of problems follows: it cannot clarify its task objectives, struggles to formulate viable execution plans, and cannot judge whether it has completed a task.
Second, coding capability. In the digital world, understanding code, writing code, and completing various tasks are foundational skills — the "language" of the cyber world.
Third, tool use capability. In the digital world, humans have already built so many tools and software. For AI to fully realize its potential, it must first adapt to these human-made tools. For instance, AI needs to use human browsers and websites to access information.

Over the past 12 months, these three capabilities — reasoning, coding, and tool use — have undergone transformative change, entering a phase of exponential growth. To measure these capabilities, the industry has developed various benchmarks. For reasoning, we commonly use GPQA, a test simulating human PhD qualifying exam level. On this test, ordinary humans score around 20-something, while human PhDs reach about 60. In early 2024, the most cutting-edge AI models scored only in the teens. But now, frontier models like o3 have reached over 70 (if I recall correctly) — so this has risen extremely fast.

For measuring AI programming capability, the industry commonly uses SWE-Bench, which draws from real human programming tasks on GitHub. In early 2024, 4o scored in the single digits — essentially unusable. But now o3 has reached 70-80, meaning AI can solve 70-80% of human programming tasks.

The rapid development of AI capabilities has now created a new problem: it's becoming difficult to find suitable tests for AI. Recently, Terence Tao proposed a test called Frontier Math, where even the simplest problems are at IMO (International Mathematical Olympiad) difficulty level. At the time, people thought these problems would stump AI for at least several years. But now the o3 model has already scored 25 on Frontier Math, with the o4 model performing even better.

Once reinforcement learning is applied to a domain, the AI's growth curve often becomes exponential. Just as AlphaGo, using RL techniques, made a massive breakthrough in Go. Later, DeepMind's AlphaStar, also through RL, rapidly surpassed top human players in StarCraft. And autonomous driving — technically speaking, it's already many times safer than human driving; it's only regulatory factors preventing large-scale deployment. I call these landmark moments when AI capability surpasses humans "the Lee Sedol moment." Everyone remembers when Lee Sedol played five games against AI and lost four — the realization that AI could effortlessly defeat even the strongest human.

Q: Will humans soon lose the ability to evaluate AI capabilities?

Yusen Dai: I think we're already quite lacking in this regard. Like that "Humanity's Last Exam" that Alexandr Wang put together — AI is already at 20 now.

Q: Out of 100?

Yusen Dai: Right. Going from 20 to 80 could happen very quickly. The key is that for humans to come up with hard problems is itself a major challenge. But if AI can achieve it through more compute, RL, and stronger inference, the gap will be hard to close.

Q: Regarding this "Lee Sedol moment" you mentioned — the beginning is definitely AI surpassing humans, which is intuitive. I've spoken with some Go enthusiasts, like Tiancheng Lou, who said that when AlphaGo Zero appeared, it not only surpassed humans but operated at a level of intelligence that humans couldn't comprehend. He felt this way about both Go and autonomous driving — you can't really tell much from a test ride. Same with Go: human-accumulated joseki patterns built over millennia were casually shattered by AI.

Yusen Dai: I don't think interpretability or explainability necessarily exist.

**Q: Because by first principles, humans currently have no way of grasping all truths and laws in the world.

Yusen Dai: For instance, we also can't understand how Einstein came up with his theories. And if you think further, cats and dogs certainly can't understand why humans do all sorts of things, right? With AI developing so fast now, we may soon face a situation analogous to an elementary school student testing a PhD. We may gradually be entering such a stage, where the student racks their brain to come up with what they think are super-hard problems, but to the PhD, they're not difficult at all.

This is a crucial issue for AI safety — we may become unable to evaluate. Because many existing human tests, AI can already easily score 95 or above. Like the saying at Tsinghua: some people score 100 because their ability ceiling is 100, while others score 100 because the test maxes out at 100 — if it were out of 1000, they'd score 1000.

**Q: Are we already at that stage? Where we can no longer evaluate AI capabilities?

Yusen Dai: I don't think we're unable to evaluate yet, but in the foreseeable future — perhaps within a few short years — it will become very difficult.

**Q: What will that bring?

Yusen Dai: Actually, people are already seeing many related signs. For example, during the Spring Festival there was an article, supposedly Wenfeng Liang's response posted on Zhihu, that went viral — and later people discovered it was written by DeepSeek.

I've been using OpenAI's Deep Research a lot lately, and it's been incredibly helpful — and eye-opening. We were just talking about Agents, and actually, the first real use case for Agents is research assistance. You ask it a question, and it has to figure out how to answer, map out a research plan, find sources, synthesize and compare. We went from 4o with no reasoning capability, to o1. Then o1 got o1 pro for deeper thinking, then o3 mini high, and now Deep Research. The whole progression took maybe 3-6 months, but the improvement felt exponential.

Just yesterday I was thinking: if you grabbed ten random people off the street, at least nine of them probably couldn't outperform Deep Research. Because Deep Research can, in a matter of minutes, produce a research report on virtually any topic that I'd say reaches the level of a white-collar worker with one or two years at a decent company. And honestly, many people simply don't have the reasoning skills, information-gathering ability, or synthesis capability to match that, no matter how much time you give them. So I think AGI isn't a sci-fi concept anymore. If people were talking about AGI two years ago, it still felt distant. But now, for tasks like collecting and organizing information, AI has already surpassed most people.

Q: People like us — information workers, bit in, bit out.

Yusen Dai: So a conversation like the one we're having today, AI still can't do this. This is proprietary information between us — it didn't exist before we started talking. But if the information already exists somewhere, if it's not proprietary, then AI would definitely do a far better job than the vast majority of people. I'm quite certain of that. The growth rate of AI is genuinely extraordinary. We've already seen its exponential growth, and we're going to witness many more "Lee Sedol moments" like the ones mentioned earlier.

Going back to where we started, I think unlocking Agent has profound significance. Historically, I'd sum up all internet product models with one famous phrase: "Attention is all you need."

Whether it's Tencent or ByteDance, the core metric is how much time users spend on your product. You can understand this through a simple formula: time × users × monetization rate. So everyone's trying to attract more users, get them to spend more time, and increase monetization. But there's clearly a ceiling — there are only so many people, everyone sleeps 8 hours, so they're awake maybe 16 hours max, and they still need to eat, work, do things where they can't look at their phones. It's hard to double screen time. So people try to push monetization rate — how to extract more value from the same hour — which leads to TikTok's video ads, livestreaming, but that path has limits too.

Throughout human history, pretty much everything required human Attention. There's only one exception: automation. Traditional mechanical automation, like machine tools — once humans set up the automated system, it runs on its own, but it has no agency. Current AI advances create something new: first, it doesn't require human Attention, and second, it can autonomously execute tasks. It's no exaggeration to say this is the greatest advance since the dawn of humanity. If what distinguishes humans from other animals is tool use, then all previous human tools required Attention — until now, with Agent as a tool that doesn't need Attention. I toss a question to Deep Research, it researches for 5 minutes, and during that time I don't need to pay Attention. Last year when I used Devin, I'd give it a task and it would just go do it. I could interrupt with new requirements, check its progress, but if I left it alone, it would complete the task itself. So I want to propose a new phrase: In the Agent era, "Attention is not all you need."

It unlocks unlimited human potential. As I said, human Attention is finite. If human Attention no longer needs to be spent, the theoretical multiplier becomes infinite. It's like being a boss telling employees what to do — no Attention required. Previously, most people were executing the results of someone else's Attention; only a small minority were the bosses.

But now AI is getting so powerful that everyone can be an AI boss. What you have the AI do becomes a crucial question. Many people think assistants are clever, good for booking flights or ordering takeout, but don't know what else to have them do. I think this will have major implications for society, for education, but I believe once people adapt to this paradigm, they'll discover far more tasks to delegate to AI. Extending this further, I think we may see a kind of Scaling Law for work. Right now, work and productivity aren't easily scalable. Take a major tech company — even with 10 or 100 billion in capital, you can't directly convert that money into productivity. You still need to hire people, train them, and more people means more internal politics. So money doesn't automatically equal productivity. But if AI models keep getting stronger, and their reasoning capabilities keep improving, you'll find that money equals compute power, and more compute means more AI-generated productivity. This is the Scaling Law of capital converting to productivity.

Q: But does the world need this much productivity?

Yusen Dai: This is the same thinking people had before cars and airplanes were invented. Back then, people thought, why would I need to fly to the next village? I can just walk.

Q: You think it will create new demand?

Yusen Dai: I think history has repeatedly validated this with countless technologies.

Q: Relative to human history and long stretches of ancient history, the technological explosion has actually been quite brief — just four or five hundred years.

Yusen Dai: That's an even more interesting point. Originally, human technological explosions were measured in "generations." Then it became: how many technological explosions can one generation experience? Now the cycle has compressed to under a decade. From AlexNet to today is only 13 years; from ChatGPT's birth to now hasn't been long at all. Looking back at when ChatGPT first appeared, everyone thought its capabilities were amazing, but from today's perspective, there was still huge room for improvement. Technology is changing so fast that people may struggle to adapt in time, which will inevitably have major societal impacts.

Beyond that, exponential growth is actually the normal state of the world, but before the steep part of the curve, it looks a lot like linear growth. There's a saying: "gradually, then suddenly." Before the rapid upward phase, everything looks calm. This is why people concerned about AI safety are so worried — now that everyone agrees we've entered the exponential growth phase, it's no longer about preparing for rain; the thunder has started and it's about to pour. I think a massive increase in productivity is a crucial variable, if you believe productivity ultimately creates economic value.

Then the question becomes: what is productivity, and how do we make it create value for everyone? On one hand, as Sam Altman has said, one-person companies will become extremely powerful. If one person can effectively direct AI, or even direct Agents through AI, they could create enormous value. On the other hand, the reason startup founders could sometimes defeat big companies was that they could convert capital into productivity more efficiently — they had sharper vision, worked harder, faced no organizational drag. But if big companies pour massive resources into hiring elite entrepreneurial Agents, ordinary founders may struggle to compete. Perhaps only top-tier founders could still beat big companies; average founders might get eliminated by the AI that big companies can afford to hire. That's a real possibility. So some believe this will make the rich richer, because the wealthy can buy more productivity. In the past, even a wealthy person might lose to a smart young person, but the future may be different.

Q: These are two directions — one is the super-individual, the other is like a "sci-fi utopia" where resources gradually concentrate in more powerful companies.

Yusen Dai: So I think AI will bring enormous changes, both in terms of productivity and social structure. But to unlock these changes, the prerequisite is that model capabilities need to improve. I think finding the first PMF in the early stages of a technological revolution is sometimes a sweet trap — you could even call it a curse. Take mobile internet: BlackBerry was the first to find PMF. The technology was limited then — weak processors, slow networks — so they thought they could only do email, BlackBerry Messenger, push notifications. To nail this PMF, they made phones with keyboards and were proud of it. But then technology advanced: stronger processors, faster networks, bigger screens. Apple said forget the keyboard, made a full touchscreen phone. BlackBerry thought typing and emailing without a keyboard would definitely be worse. That's the curse of PMF — when technology upgraded, they were trapped by their own product-market fit.

The same happened with the internet. Yahoo was the first internet company to find PMF, with the portal model — listing information for users to browse. Then Google came along with search, which was a massive shock to Yahoo. Yahoo was complex, with tons of content you had to click into. Google was just one search box, type and go. Yahoo actually had a chance to acquire Google, but didn't offer enough, and eventually got disrupted.

So I want to say that chatbots might be a sweet trap too. Now there are so many chatbots, and people may just think about optimizing on top of that foundation. But I've always felt that chatbots might limit frontier AI model capabilities. When you chat with ChatGPT, Moonshot AI, or Doubao, don't you tend to have fragmented, short conversations like on WeChat? But if you want to give an agent an instruction, often you need to write a more substantial proposal — like applying for an NSF grant, fully explaining what you want to do, the goals, the conditions. It requires complete communication. But in a chatbot context, like WeChat, you can only communicate in fragments, and the model's intelligence may not even be able to show itself.

I was talking with people at OpenAI, and they said they found that more advanced models didn't significantly improve user satisfaction in chat. It's a bit like chatting with someone on WeChat — the difference between chatting with an average college student versus a scientist doesn't feel that big. But if you had them write a doctoral dissertation, it's a 0 versus 1 difference. So chatbot, as an early product form that's easy for people to accept, may not be the product form that makes it to the end.

If you optimize for short-term metrics on top of that — say, trying to keep people on the chatbot longer — you might end up adding a voice call feature. But does voice calling actually align with intelligence gains? Because making a good phone call depends on tone of voice, emotional intelligence, things that have nothing to do with intelligence or productivity. It makes me think of how this has played out historically: those who find the first PMF first, if they don't keep pushing deeper, often end up trapped by that very PMF.

Q: We just did a lot of forecasting around Agents. If we follow your logic of the Work Scaling Law, what forms will the first wave of Agents take in 2025?

Yusen Dai: For the first wave, I think the hottest thing right now is Deep Research. You see OpenAI put out Deep Research, though Google actually launched it first, then Perplexity came out with Deep Research, and I know a lot of startups are heading in this direction too. Why is everyone going this way? Because people discovered that having AI do deeper research — gathering more resources, then deciding what information to gather next based on what it found, forming this loop, and finally producing a research report — this is basically what we normally have analysts do. But people found that for roughly the same amount of time, or even slightly more, you get better results with this. We call this a "read-only Agent" — it only does read operations, no write operations. I think the PMF here is already pretty clear. The Deep Research I use actually performs better than my interns. So for us knowledge workers — people who need to research topics at our computers, browse a bunch of websites, and produce reports — the willingness to pay and the use case are both well-defined.

The second step is from reading to writing. OpenAI launched Operator, Anthropic launched MCP — these are all about AI using tools. Though this brings a lot of security concerns, since no one wants AI messing things up. But clearly, under controlled conditions, giving AI the ability to perform write operations and publish information externally is a crucial capability. Monica, one of our portfolio companies, is building something like this — now everyone knows it as Manus. They shared something interesting with me yesterday. For example, there's a test task about getting the subway schedule for some American city, like Phoenix. The model first checks the official website, finds the link is down, then directly calls up the email client and sends an email to the Phoenix city government asking about it, eventually reaching the step of confirming whether to send the email. It can do all of this autonomously.

Q: This is their product?

Yusen Dai: Yes, their product can invoke tools, call browsers — there are many interesting characteristics here. For instance, AI can proactively use tools, and it has its own "computer" — that's quite interesting. Previously, many people thought that applications like AutoGLM domestically were about AI controlling our phones, like having AI order takeout on our phones. But think about it: does an assistant work with their own device or yours? Definitely theirs. So it should be my AI assistant in the cloud, with its own phone or computer, using its own device to order takeout for me — not using my phone, since I still need to browse TikTok and chat on WeChat. This is essentially virtualization technology.

Q: But in terms of permissions, it's still under your account system, right?

Yusen Dai: Not necessarily — you might give the AI its own "computer." For example, if you subscribe to an expensive Bloomberg terminal, your AI assistant might say: "Boss, let me borrow your account." Then you input your credentials and let it use it. Or another scenario: you might also buy a LinkedIn Premium subscription for your assistant to use. All of these are possible.

Actually, you'll find that when AI can use tools, it can do a lot of things. After all, most software tools are used either by calling APIs or by operating the software interface itself. So the multimodal reasoning in Moonshot AI's k1.5 is very important, especially when using software interfaces — using a software interface requires understanding web pages. Everyone's talking about world models understanding the world, which is actually quite difficult. A simple example: when we look at things, we know objects have front and back, they have depth, but AI currently performs poorly at recognizing depth information. However, if it's just about operating computer and phone interfaces, AI can do a great many things.

Q: So this is the second type — can both read and write.

Yusen Dai: If I can write, let me give a random example. When AI encounters a problem, in theory it could post asking for help. It could even offer a bounty, since it's already bound to a payment provider — whoever helps solve its problem gets $100. This isn't science fiction; it's completely doable now. And we've found that powerful AI models can come up with solution approaches that humans wouldn't think of. For instance, where humans think a problem is unsolvable, AI might consider whether to reframe the problem, or whether to obtain permissions it didn't originally have.

But this is also something AI safety research needs to pay attention to, because AI might actually do harmful things in order to solve problems. I personally encountered a typical example: I used Windsurf to build a demo personal website, and to deploy it, it said two processes were occupying ports and needed to be killed. I agreed at the time, but later thought — what if the system crashed after killing them? It was just trying to deploy that demo website without considering the potential impact on me. Of course these issues can be aligned, but there are many potential risks.

So this kind of Agent with "write" capabilities, once well-developed, will be very powerful, but deployment will definitely be slower, because the potential consequences are also significant. It requires extensive monitoring, training, and alignment, plus preventing abuse. So I think "read" will come faster. For "write," Operator is an example — when you use it to book flights, you'll find it's not as fast as booking yourself, with confirmation needed at every step. But in AI, slowness always gets solved. Going from slow to fast, from expensive to cheap — this has always been happening in AI. Just imagine: if something that originally took an assistant 30 minutes, AI can now do in one second, how much more can get done each day? The freed-up time can be used for so many other things — the impact on people will be enormous.

Q: Is this progression what OpenAI previously defined as those five technical levels? Below Agent is Innovator, and below that is Organization.

Yusen Dai: Yes, and this gives rise to several questions. The simplest one: right now humans direct Agents, but can we achieve Agent directing Agent? If every task can be completed in one second, human speed of asking questions can't even keep up.

Q: In the future, when making interview outlines, it might be Agent interfacing with Yusen's Agent, and they'll write the outline themselves.

Yusen Dai: I think this is completely possible. But there's an important issue: memory. Right now, if you use ChatGPT and I use ChatGPT to answer the same question, the results are pretty similar. But if it's an assistant who's worked with me for several years, besides the public knowledge part, their answers would definitely differ from yours. That way our Agents would actually have content to discuss, because we'd each have our own memory. But right now this memory mechanism is still very rudimentary.

I think memory is particularly important — everyone's working on it but no one's done it particularly well. Take ChatGPT: its so-called memory is basically forming a system prompt during conversation with you, like remembering "this person has a dog, this person is a college student" — very simple. But in reality, true memory is very long, and some of it is actively fed to it during conversation, while some might be obtained through other means. In any case, memory is definitely a crucial point.

Also, online learning — this is very important too. Humans have a unique capability that AI currently lacks: AI models still need to release new versions to update weights. But humans in daily life, whether through reading or social interaction, can continuously learn and actively change the "weights" in their brains — this is a biological characteristic, while AI currently has to go through a training process for each update.

Additionally, there are many interesting frontier exploration topics right now. For example, currently Agents use human tools, but if they're ten times smarter and ten times faster than humans, why use human tools? It's like how we wouldn't eat with children's utensils — we'd use adult-appropriate utensils. So there might be a whole suite of tools designed specifically for AI. Tools designed for superhumans are definitely different from those for ordinary people. In this regard, AI-specific tools and how AI iterates on its own tools are worth researching. Perhaps eventually its tools will be ones we humans can't use, just as many people can't use EDA.

Q: And it's possible the AI can design these tools itself.

Yusen Dai: So thinking further down this path, the iteration speed here reaches science fiction territory. But now we're finding that many concepts previously considered pure science fiction are no longer out of reach — with a bit more model development, these things can be achieved. So I think intelligence advances will unlock new product forms. And these new product forms could be extremely powerful. If you only optimize and polish on top of the existing chatbot, you might get disrupted pretty quickly.

Q: Actually, when we talked about Agents two or three months ago, you still mentioned coding, but you didn't mention it just now.

Yusen Dai: You mean Agents for coding? I think the relationship between Agent and coding: the first step is Agents that do coding, like Cursor or Windsurf — this is currently an easier scenario for Agent deployment. But I think the next step is Agent that can code. For example, your assistant might be a liberal arts major; if you have them learn to write code, they could write a scraper to gather more information for you, so during interviews you'd know who to interview — it's like your Agent has acquired coding as a new skill. I think this will be the next bigger development paradigm.

Initially, Agents were mainly for writing code, but not that many people need to write code. Tools like Cursor, Windsurf, Devin are primarily aimed at programmers. But programmers are a limited percentage of the population. So for more non-programmer knowledge workers — ordinary white-collar workers — what role should their Agents play? I think coding ability is essential for their Agents, because only through coding can they navigate this cyber world with ease.

Q: The industry is evolving incredibly fast. Just a few months ago, when people talked about Agents, coding still felt like one possible direction, and plenty of founders were building in that space. But now the conversation has shifted — it's about Agents that can write code, and then using that capability to do much more.

Yusen Dai: Before, Agents were Coding Agents — purpose-built for writing code. Now, it's an Agent that can code.

Q: What other capabilities do you think are needed to build a great Agent?

Yusen Dai: Let me organize this. Right now, there are three core capabilities: reasoning, coding, and tool use. Then beyond that, memory and online learning. All of these are critically important and still unsolved problems.

Q: In 2025, do you think building Agents will fall more to application companies, or to model companies with exceptional capabilities — like OpenAI with Operator, or Anthropic with Computer Use?

Yusen Dai: At this stage, model companies do have genuine advantages. They can use RL to improve model capabilities, and use more powerful models to optimize their own models. But application companies have several strengths too. First, they can mix multiple models and leverage each one's strengths. Second, there's user mindshare. Take Perplexity — it started with AI search, captured that mental real estate, and as its underlying models kept upgrading, most users just think of it as the AI search product. Cursor is another great example. Initially, people dismissed it as a wrapper, but it and the model actually elevated each other. Without Sonnet 3.5, Cursor wouldn't have taken off — it couldn't have done predictive next-step coding. And without Cursor, Sonnet 3.5 might have lacked the vehicle to break through.

Q: You mentioned Monica, another company you invested in. They're exploring Agents based on other models or open-source models, right?

Yusen Dai: They don't train their own models. Barring delays, they'll release a very interesting Agent product next week (Manus launched its beta on March 6, 2025). We believe that when you can take models, have them use tools, and layer on clever product design, you can create a genuinely different experience.

Q: You mentioned that the chatbot format was a "honey trap" for whoever found PMF first. Are there similar "traps" in the Agent application form? I mean, what aspects might distract you or slow your march toward AGI?

Yusen Dai: I haven't thought through Agents clearly enough yet. It's still in an exploratory phase, so it's hard to say. But I have a feeling: if an AI product has massive user volume, serving all those users may force compromises on model size and capability. Simple example — lots of users, large model, and in China, charging is difficult. If you're providing a high inference-cost model free to millions, that's unsustainable. So you might need to make the model lighter. But does a lighter model conflict with pursuing AGI? I think when DeepSeek gained so many users, lots of people discussed whether to retain them — I believe that's also a "honey trap." Having tens of millions of DAU, with users worldwide in vastly different scenarios — serving them well demands enormous investment in compute, product design, and operations. I think that diverts resources from AGI exploration, and resources aren't infinite.

Q: Right now, it seems DeepSeek isn't intentionally trying to retain users.

Yusen Dai: I think that's the right call, and it's what enables the WeChat partnership. If DeepSeek tried to use this moment to build a super App, WeChat probably wouldn't work with them.

Q: I just thought of something — multimodality. Though for Agents, I think what's more relevant is multimodal understanding, not generation.

Yusen Dai: Multimodality is definitely important, but right now it doesn't boost intelligence that quickly. Language is an extremely condensed form of intelligence — improving intelligence through language is a faster path. Once language is sufficiently explored, next comes images. Images contain massive information; any random photo has lots of it. But images don't contain much intelligence. You'd need to watch enormous amounts of video to extract any intelligence from it. Whereas understanding Newton's laws takes just a few sentences — how much video would you need to watch to derive Newton's laws? So I think video matters more for concrete applications; for intelligence generation, its information compression rate isn't high enough yet.

Q: Then why was everyone training multimodal models for a while?

Yusen Dai: Two situations. First, the multimodal generation route like Sora — this has clear PMF. There are so many video ads worldwide. Things like the viral "cooking orange cat" videos — get them decent enough and they monetize. There's a business model there. Midjourney didn't even raise venture funding and already found preliminary PMF. With PMF and good results, of course people build there.

Q: How are Midjourney and Sora's DAU now? Declining?

Yusen Dai: Midjourney is doing okay — its first batch of users are already onboard, they came self-sufficient. For Sora, I think Keling AI and Hailuo AI achieved good results following its technical approach, and now Sora looks like it woke up early but failed to astonish. Though Google's Veo 2 released yesterday was quite impressive — at least for single shots, it's currently the best video generation model.

But the general consensus now is that video generation may not be the most important direction for advancing intelligence — the field is still "racing" toward reasoning. It's like walking: when there's a clear path ahead, many people will take it first. So in AI, we'll constantly alternate between exploration and sprinting. When you hit bottlenecks, you realize those seemingly aimless exploratory branches might yield breakthroughs. From a company perspective, on one hand you need to "sprint straight ahead" — it's a race. On the other hand, you need frontier exploration, because you never know what might happen short-term.

Q: So it still takes big companies to do this? In the US that's Google, in China that's ByteDance.

Yusen Dai: In the US there's also OpenAI.

Q: So startups simply don't have the resources.

Yusen Dai: I wouldn't put it that way. It depends on what stage we're in, and how long that stage lasts. If we're in a phase requiring innovation, startups might avoid big-company competition through different vision. But if it's a "straight sprint," then whoever has money and GPUs will pull ahead. Startups have always excelled at doing what big companies don't see. Once it's "open cards," big companies have the advantage.

Q: When we discussed Agents potentially going mainstream in 2025, we didn't really touch on cost. Is cost reduction an important driver for Agent development?

Yusen Dai: Absolutely, and I believe cost reduction is inevitable. So my baseline assumption is: make it work first, then make it cheap. Cost will definitely come down, and Agent capabilities will keep strengthening, but hitting bottlenecks and roadblocks along the way is entirely possible. So I think: make it usable, then make it good, then make it cheap. If you can't even make it usable, cheapness is irrelevant.

And I think Agent adoption difficulty differs between China and the US. US labor costs are extremely high right now — you constantly see tight job markets, unfilled positions. So for them, Devin pricing at a few dollars per hour of work — we might find that expensive, but for US companies, California's average minimum wage is $16. Even McDonald's pays $16 an hour, while an Agent costs $6-8. First, it's cheap. Second, in a year it'll be more capable, so same price becomes even cheaper. In an environment accustomed to paying for enterprise services, this makes sense.

I'm myself a GPT Pro subscriber at $200 monthly — I find it an incredible deal. It lets you do 100 Deep Research queries, $2 each. If I asked my intern, first I couldn't demand a report in five minutes at 2 AM, and second the quality wouldn't match GPT Pro's. So I keep telling interns: if you're just collecting information to produce a vague report, it may genuinely be worse than a $2 service.

William Gibson said: "The future is already here — it's just not evenly distributed." I think the gap between those already using frontier AI or using it well, versus those who've just tried chatbots or haven't used AI at all, is extremely uneven. So I truly believe that for paperwork, AI replacing humans isn't imagination anymore — it's happening now.

Q: So after RL, after unlocking Agents, what might be the next technological paradigm?

Yusen Dai: First, I think RL can go very far. Second, I believe discovering new knowledge is crucial next. Anthropic's founder Dario wrote an essay called Machines of Loving Grace. He mentioned that for AI to advance further, it's about discovering new science, acquiring new knowledge — which also seems to align with OpenAI's five-level framework.

Q: Level four. Level four is innovator.

Yusen Dai: Because most scientific discoveries start with a hypothesis, then get validated through experiments. AI may already be quite good at the hypothesizing part. But the validation step — sometimes requiring observation, sometimes physical, chemical, or medical experiments — can hit certain constraints. If we could find a way to run experiments at massive scale and in parallel, to verify whether AI-generated hypotheses are correct — including things like mathematical theorems, where new knowledge can emerge purely through reasoning — then from that point, AI might enter a kind of "left foot stepping on right foot" state, producing new knowledge and then using that new knowledge to improve itself, potentially creating a self-iterating, self-evolving process.

But by that time, the product forms that emerge might look quite different again. Many big-name folks have asked me: when will we invent the immortality pill? I think this might be the shared goal after people have made a lot of money. People may no longer just want Agents to do lots of work — they might want an immortality pill instead. And it could solve many major problems facing humanity, like what the actual cure for cancer is.

Q: Once AI gets smarter, it might find more efficient ways to use energy on its own, maybe even solving controlled nuclear fusion — the problem humanity hasn't cracked in 50 years — creating a closed loop.

Yusen Dai: It's that AI can complete tasks humans can do, but will soon encounter tasks that humans can't solve. This is like the "Move 37" that Lee Sedol faced — you don't know how that move was made, but as long as you can verify the result, even if you don't understand how it was produced, you discover that it actually works, that it's usable, and that can lead to lots of new progress.

Recommended Reading