Are Models and "Shells" Both Undervalued? ZhenFund's Yusen Dai's Mid-2025 AI Review --- The past six months have been a turbulent period for AI. On one hand, DeepSeek's breakthrough has reignited global enthusiasm for Chinese AI; on the other, the commercialization path for large models remains unclear, and the market is growing increasingly anxious about when AI will generate real returns. Against this backdrop, I believe it's necessary to take stock of where we stand at this mid-point in 2025. The Core Question: Are Models and "Shells" Both Undervalued? Over the past two years, the AI industry has cycled through several narratives: from the initial frenzy over foundation models, to the subsequent boom in AI applications ("shells"), then a period of disillusionment as applications failed to deliver, and now a renewed focus on models following DeepSeek's success. But I want to propose a contrarian view: both models and "shells" may be simultaneously undervalued by the market right now. This might sound paradoxical. Let me explain. Why Models Are Undervalued The market's current skepticism toward model companies stems largely from two concerns: the commoditization of models (will they all become undifferent

真格基金·July 28, 2025

In 2025, the "Sedol moment" for industries far and wide is only just beginning.

This episode's theme is a mid-year AI review and outlook for 2025.

The recording was split into two sessions. The first took place on July 18, when Yusen Dai, managing partner at ZhenFund, and Cheng Manqi, head of tech coverage at LatePost, sat down for a halftime review of Moonshot AI's newly released Kimi K2, emerging trends in AI application adoption, and the intensifying talent war that has heated up over recent months.

The second session was recorded on July 21, when we followed up on breaking developments: on July 18, OpenAI released ChatGPT Agent; on July 19, OpenAI announced that an unreleased general-purpose large language model had achieved IMO (International Mathematical Olympiad) gold-medal-level performance for the first time; and on July 22, Google DeepMind announced that their Gemini DeepThink model had matched this feat. Previously, only Google DeepMind had reached silver with a math-specialized model.

Two and a half years after writing Sparks of AGI, current OpenAI researcher Sébastien Bubeck described this achievement — a general-purpose LLM winning IMO gold — as potentially "a moonshot moment for AI."

This AI race, now running for over two years, has never slowed. Model capabilities and application innovation keep leapfrogging each other, and both may still be underestimated.

The global AI community remains in the Early Adopter phase. They're willing to experiment, willing to give feedback. When you put out a good product and treat users with openness — whether it's DeepSeek, Kimi, Manus, or Genspark — users from everywhere have proven they'll not just appreciate and support you, but actively help improve your product.

Looking back at the Kimi K2 story, you can also see how "betting on people" has been redefined in the AI era. Kimi was a team built on technical vision and technical capability from day one. In 2023, when AI winds shifted nearly every month, Zhilin Yang's team bet on long-context — and built the first version with search capabilities on top of it. It was a wager on the future.

AI is getting people to try things they never would have before. This IMO gold-medal moment makes the signal of approaching AGI even clearer. If in the past we only saw a distant train smoking on the horizon, now we can clearly hear its roar.

For 2025, "Lee Sedol moments" across industries may only just be beginning.

OpenAI Wins IMO Gold: Another Lee Sedol Moment

Q: What are the most important recent developments worth unpacking?

Yusen Dai: A lot happened this past weekend. The biggest one, I think, is that a new OpenAI model achieved gold-medal-level performance on the IMO 2025 problems — specifically, solving five out of six.

Why does this matter? According to OpenAI's description, this was a general-purpose large language model without internet access, without special optimization for math, and without any Code Interpreter or similar tools. It solved IMO proof problems, and OpenAI had three IMO gold medalists cross-verify that the solutions were correct.

Of course, the result has drawn some controversy. Some pointed out it hasn't been officially certified, so it may not count. Terence Tao also noted that IMO problems can have many different solution paths, leading to varying scores.

Note: On the evening of July 22, Google DeepMind CEO Demis Hassabis posted on X emphasizing that this result had received official recognition from the Olympiad organizing committee.

But regardless, this is a watershed moment. A language model, without special math optimization, solving IMO-level proof problems in an offline environment. Previously, Google's AlphaGeometry was a model specifically designed for math, relying on formal verification methods without generalization capability.

Q: That was exactly a year ago — in July 2024, Google's AlphaGeometry reached IMO silver level, just missing gold. But that wasn't a general-purpose LLM, while OpenAI is claiming this is a general model.

Yusen Dai: Right, and this year the timing coincided with when the IMO problems were released. OpenAI ran the model immediately, so there's no possibility the model saw these problems during training.

Despite all the progress language models have made over the past year, tasks like mathematical proofs — especially IMO problems — fall into the "hard to verify" category. Verifying whether an answer is correct is itself extremely difficult.

These types of problems have long been considered beyond current language models' reach. And in reality, most truly important problems in the world don't come with standard answers or solutions. So when a language model can solve such difficult problems at human expert level without any special tuning, it means its reasoning capability has genuinely reached a new tier.

OpenAI also noted that this capability can be further improved by extending thinking time, which validates inference scaling law.

We've discussed before that beyond pre-training, there's post-training and inference scaling law. This result demonstrates three things:

1. LLMs have strong generalization — they can solve problems we previously thought unsolvable;

2. The stronger the model, the broader the applicable scenarios, and the greater the value created;

3. IMO proof problems share formal logical similarities with certain real-world science problems — both are proofs. If LLMs can do the former, they may be close to the ability to discover new knowledge.

They certainly can't solve ultra-hard problems like Gödel's conjecture yet, but discovering new scientific knowledge may already be within reach.

There's also some gossip: apparently the model OpenAI used shares the same base model as GPT-4o. In other words, this achievement didn't come from major improvements to the base model, but from optimizations in post-training and inference. The remaining optimization space leaves much to imagine for AI's future development.

Q: You heard this from technical sources?

Yusen Dai: Yes, did some quick asking around. This all happened within 24 hours, but my reaction has been very strong.

It reminds me of Microsoft's March 2023 paper Sparks of AGI, when they tested a pre-release version of GPT-4 and marveled at seeing sparks of AGI within it. That was just two and a half years ago, and now we've reached the point of solving IMO problems. Two and a half years is a blink in the history of technological progress — shorter than the journey from seed round to product launch for many startups.

One of that paper's authors later joined OpenAI, and upon seeing this IMO gold result, he called it "AI's moonshot moment."

A language model that "just predicts the next token," with no tool assistance, completing creative mathematical proofs that only a tiny handful of genius-level humans can produce. This genuinely shows AI capability has reached a new height.

When we recorded "A Long Chat with Yusen Dai on AI Agents" earlier this year, we said 2024 would be the year many industries face their "Lee Sedol moment." A "Lee Sedol moment" means AI surpassing the strongest human level in a particular domain.

We've already seen this repeatedly in Go, programming, and mathematical reasoning. More such moments await us, solving problems we thought were still difficult and distant.

Q: I also saw that it wasn't just OpenAI. After OpenAI's announcement, a Google researcher posted on X (formerly Twitter) saying OpenAI beat them to the announcement.

Yusen Dai: We're watching closely. It seems Google DeepMind also achieved gold, but we don't know if it was with a general-purpose model. If it was, that means this capability isn't limited to one company. Once such technology diffuses, it will bring massive leaps in reasoning capability — every model vendor will benefit.

Note: On July 22, Google DeepMind announced that Gemini Deep Think achieved officially certified IMO gold and published its detailed solution process. This general-purpose model solved the problems using only natural language (English).

Q: Have you talked with domestic practitioners? Were they shocked, or was this within expectations?

Yusen Dai: I think everyone knew the direction — toward stronger reasoning capability. Everyone knew better reasoning means solving harder problems. But actually achieving it now is still shocking. I've spoken with some of China's top researchers, and they expressed genuine surprise. But like the atomic bomb: from the moment it detonates, once people know it's possible, they're not far from building one themselves.

Q: From a technical progress perspective, Go, programming, and math are three canonical "Lee Sedol moments." How do you see their different impacts?

Yusen Dai: Mathematical reasoning is actually harder than programming.

Programming is a well-verifiable problem. Reinforcement learning succeeds in programming largely because the reward is clear — code runs, passes test cases, and you know it's right.

But grading math proofs is extremely complex. IMO is the classic "hard to produce, hard to verify" problem.

Go is a two-player game with perfect information and clear win/loss outcomes — ideal for reinforcement learning. Programming is also structured, and much code is built from existing pieces rather than created from scratch.

But math is the foundation of science and engineering. Its reasoning underpins many disciplines. And unlike experiments in the natural world, it relies purely on logical thinking. So if AI can solve IMO-level proof problems, it could transform how knowledge is generated across science and engineering — potentially driving even greater scientific progress than programming.

Q: Do you think this application will have a bigger impact?

Yusen Dai: It might be bigger. Right now, programming is mostly replacing repetitive, entry-level work. Vibe Coding, for instance, is largely copy-pasting frontend code. But mathematical reasoning brings powerful thinking and the potential for genuine new knowledge discovery. Obviously, that's the more valuable part.

AI keeps taking over simpler tasks, while humans move toward more valuable, harder problems. But now AI is chasing humans into solving the most valuable, hardest problems.

That's why this IMO gold medal event gave me a clearer signal that AGI is coming. If before it was like seeing a smoking train in the distance, now you can hear its roar.

Q: Some people on X (formerly Twitter) said reinforcement learning can now handle domains where reward feedback isn't so direct. That might be one of the bigger breakthroughs behind this progress. Others mentioned "verification asymmetry" — that for some tasks, producing the solution might take less time than verifying it. IMO math problems fall into this category.

Yusen Dai: Right. Many tasks used to be "hard to produce, easy to verify," like writing code. But now we're seeing "hard to produce, hard to verify."

Q: Though some argue that in these high-verification-cost tasks, AI still can't fully replace humans, because ultimately humans have to make the judgment call.

Yusen Dai: Possibly. But simply producing the proof is already a qualitative leap. We don't know all the details yet, but we look forward to more public information, or other model companies replicating similar results. Given the current pace of AI development, once something has been done, it's no longer an unattainable problem.

ChatGPT Agent Launch: The "Shell's" Value Lies in Context

Q: In the early hours of July 18, OpenAI released ChatGPT Agent. But unlike Manus, this Agent left many people somewhat disappointed — not as impressive as expected.

Yusen Dai: I think this reflects that OpenAI, as the leader in AI and the largest AI application company, is also making Agent a major priority. As we've been discussing since the beginning of the year: understanding goals, breaking down planning, coding to use tools, reviewing and reflecting on results. From the initial concept, to the first wave of products like Devin and Manus, to the ChatGPT Agent launch — Agent has gradually become the consensus for AI applications, the direction everyone's converging on.

Q: Some people said "OpenAI released a Manus." What do you think?

Yusen Dai: We wouldn't see it that way. I don't think you can underestimate OpenAI. They have the most people, the most GPUs, the most users, and they've done a lot of thinking about safety, adding many extra constraints. Actually, ChatGPT Agent's capabilities are heavily restricted — that's responsible behavior.

This product is also the first they've rated "high risk" in their AI safety evaluations, which shows they're genuinely concerned about risks like phishing sites or bioweapons information (see OpenAI's ChatGPT Agent System Card). When companies get big, they become more cautious — which conversely shows where startups have opportunities for speed and bold breakthroughs.

Q: I think the "released a Manus" comment isn't necessarily about poor quality, more about product form. It essentially combines Operator and Deep Research, and the form factor resembles Manus or Genspark.

Yusen Dai: Yes, Manus did pioneer a direction: making what AI is doing visible and understandable, giving people context. Otherwise, if you only see the final result, it's confusing. So we're seeing Manus, Genspark, Moonshot AI, and MiniMax and other Chinese teams comparing their already-launched online Agents against OpenAI's Task. I have to say, these companies' products perform better than ChatGPT Agent on many dimensions — making PPTs, for example.

Q: ChatGPT Agent's PPTs are honestly pretty ugly.

Yusen Dai: But this gives me several insights:

First, Chinese teams are genuinely strong on product. The mobile internet era had plenty of examples — TikTok, Shein, CapCut. Chinese teams have made many great products.

Second, so-called "wrapper" products — applications calling APIs — won't necessarily get crushed by model-native products. Previously, people assumed that if OpenAI itself entered the space, with end-to-end trained models, it could completely replace third parties. But that's not the case. Especially with Agents, which need more context and tools, much depends on the shell and the environment the application itself provides.

Manus shared that article on Context Engineering, "How to Systematically Build Context Engineering for AI Agents?," which received great feedback. Because this is a problem everyone's working on now, and there's a lot of practice that requires time and experience.

I understand Context Engineering evolved from Prompt Engineering. Prompt Engineering is giving AI a command, a task, and having it execute. This resembles traditional management: the boss assigns tasks, employees carry them out. But advanced companies like Netflix and ByteDance emphasize "Context, not control" — meaning you give employees more context and authority to complete tasks better. Context Engineering works similarly: what we provide to models is context that helps them complete tasks better.

The first level is within a single session: how we provide better context, better data, in formats more suitable for model operation.

The second level is multi-session or cross-session personalized memory: what did you do today, what corresponding things to do tomorrow, user preferences, habits, work experience — can these accumulate? This could become a moat over time. With the same model, whoever has better context, it understands me better.

The third level is that product design itself can provide context models couldn't otherwise access. For example, a hypothetical not-yet-built product: glasses that let you see the world around you in real time. That kind of context is something models can't generate on their own — it requires good hardware and software design to achieve, which shows the value of the product layer.

Q: The data from those glasses you mentioned is something no current internet giant has.

Yusen Dai: Right. So from the ChatGPT Agent launch, I see three things:

First, the Agent direction is gradually reaching consensus;

Second, startups still have flexibility, first-mover advantage, and competitiveness against core model giants;

Third, this further confirms two things we said were undervalued: model progress speed was undervalued, and the value of the product "shell" was also undervalued.

On the model front, OpenAI just won an IMO gold medal this week, showing its progress is still rapid. And OpenAI's own ChatGPT Agent still has huge room for improvement, showing the "shell's" value is also very important. So I think both models and applications were undervalued.

Q: What you said about context's value is excellent. It actually connects to management too — like Netflix's book No Rules Rules starts with this logic. AI really is like a person: you teach it, give it context. Rather than giving specific instructions, this might be the better approach.

Yusen Dai: In the first phase, everyone was writing better prompts, like a boss writing a brief. Then people realized you need more examples, better context, environments more suitable for model work. There are many specific techniques here. Manus's article shared a lot. What I want to say is: more context brings improved model capabilities, and also reflects our increasingly deep use of AI, and products themselves becoming more complete. Before, one prompt could get you running; now the product itself needs to take on more, and that's where product companies' value shows.

Q: When do you think this layer of application value gets absorbed by the model itself? Is it when models gain online learning capabilities, constantly absorbing new context?

Yusen Dai: It also depends whether your product has user input. Without user input, no matter how much the model learns, it can't learn what's unique to the user.

Q: So does this bring back a logic that was previously dismissed? In the mobile internet era, more users meant more data feedback meant better recommendations — a data flywheel. But later people felt user input didn't help improve model intelligence in large models. Yet what you just said suggests user input does help with context.

Yusen Dai: These are two different issues. When people say the data flywheel is broken, they mean user chat logs can't improve model intelligence. I agree with that. Because current model intelligence already exceeds ordinary people. Chatting with it about everyday topics can't improve its capabilities.

Initially models learned human preferences through RLHF (reinforcement learning from human feedback). Now ordinary people's feedback seems less meaningful. If AI can solve IMO problems, why care which answer ordinary people prefer? So for tasks with correct answers, user input is increasingly worthless.

But for completing specific work — like how Agents can better achieve goals when doing human work — user input and preferences certainly matter.

Q: So user data actually helps product experience, but doesn't necessarily directly improve model capabilities?

Yusen Dai: Yes, especially for intelligence-based tasks or those with standard answers. Initially, large models could be seen as a compression of average human intelligence at scale. Wasn't it Ted Chiang who said that language models are essentially a fuzzy compression of the internet? But now, they've clearly surpassed ordinary human levels and reached something superhuman. At this stage, simple data may not be as useful.

Q: At this point in time, is it better to build Agents sooner rather than later? Because the more user context you accumulate, the more valuable it becomes. Previously people worried that new, stronger models would simply overwhelm existing products.

Yusen Dai: If you don't have context, don't have environment, and end up just fine-tuning models, then yes, you could indeed be replaced by new models.

AI Application Adoption: The Most Important, the Overrated, and the Underrated

Q: Last time we talked was February, and now it's been about five months. We're halfway through the year. Looking back, what do you think were the most important things that happened in AI in the first half of 2025?

Yusen Dai: Overall, AI has moved from being a research-oriented technology that seemed novel but had limited practical utility into the mainstream market. I think there were several major developments in the first half.

First, the breakthrough of AI in programming. Coding has become the top priority for AI applications. I even heard today that OpenAI now has three business lines: GPT, API, and Coding. Users have found AI coding products genuinely useful and are willing to pay for them. The growth rate of tools like Cursor represents this trend. Claude Code has also been called an L3 or Agent-level product — it can write faster and better than humans, with more elegant code, and handle larger codebases. So AI in programming has officially crossed the chasm into the mainstream market.

Second, the official release of o3 in April, accompanied by rapid ChatGPT user growth — this is the continued evolution of reasoning models. Since the second half of last year we've seen OpenAI launch o1, R1, and now o3 this year. It marks the transition of reasoning, Q&A, and problem-solving capabilities from the research realm into products that ordinary users can use — it's truly landed.

ChatGPT's user growth continues, and this wave benefited from o3's improved reasoning capabilities. We also saw breakthroughs from China earlier this year — for example, R1 was an important step forward in reasoning domestically, and Kimi Researcher was the first widely available deep research product with very positive user feedback. AI in this domain already outperforms the vast majority of people, and has also crossed the chasm.

Third, Agent applications have begun to proliferate. Devin was the first product to show people what an L3 Agent prototype could look like. Manus and Genspark both launched in March, and Claude Code has been continuously improving. We've seen that as models have strengthened in three key capabilities — reasoning, programming, and tool use — the first batch of products with complete Agent form has emerged: they can receive fuzzy goals, autonomously call tools, find solutions, evaluate task progress, and ultimately complete tasks. While they're not yet mainstream, they've entered the Early Adopter stage, and in some scenarios users are very willing to use them. Despite remaining issues, Agents have become genuinely useful, which is one of the most important developments in AI applications this half-year.

Fourth, the rapid advancement of multimodal capabilities, especially image generation. From early toy-like tools, they've become genuine productivity instruments. For example, ChatGPT's image generation follows semantics very well and accurately understands user intent. Many people now use AI to draw comics, create flowcharts, and produce visual content. This improvement in generation capability has become highly practical.

Q: It can also support Agents in producing richer outputs, right?

Yusen Dai: Right, because its instruction-following capability is getting stronger and better satisfies user needs. Before it was like gacha-style uncertainty, but now it's improving steadily. So many livestream avatars have become AI-generated.

Veo3 is also an impressive model. After it added voice dubbing, I posted on Moments exclaiming that the worlds it generates are approaching a virtual reality indistinguishable from truth. Veo3 was the first time I felt a crossing of the uncanny valley — indistinguishable from real.

Fifth, the talent war. Whether it's Meta's large-scale poaching, startups raising funding frantically, or the recent drama around Windsurf's acquisition, all show that Silicon Valley's competition for talent and capital has entered a new phase. We're seeing similar heat domestically: rising funding amounts, hot projects, and the reappearance of multiple rounds in a single month. This is because people are genuinely seeing AI land — it's no longer just a concept, and many are already generating real revenue.

Q: Your main focus is still around AI application adoption, with technological change as the driving force, right?

Yusen Dai: We believe improvements in foundational model capabilities are the key to unlocking application scenarios. Model capabilities combined with good product design can truly release value. A genuinely valuable AI application must find some way to get users to pay, whether through subscriptions or per-job delivery. So we particularly focus on AI's value in productivity enhancement, especially in digital-world applications. You can see that many of our investments are in AI Agent or AI productivity projects, because these are currently the scenarios that can most genuinely help users solve problems.

Q: Besides the application thread you mentioned, AI hardware is also hot — things like robotics and embodied intelligence are also on the AI industry chain.

Yusen Dai: Yes, but I think one overrated direction in the first half was humanoid robots. Tesla recently lowered its production expectations for Optimus, which I find very representative.

I said last year that expectations for Optimus entering factories to tighten screws were too high. At the time some people said Tesla would have ten thousand robots working in factories by 2025, which completely underestimated the difficulty of manipulation. Now we're seeing some demos that are indeed getting better, like folding clothes, but actually having a robot make a cup of coffee remains extremely difficult.

Of course, I believe this field is still developing rapidly, and we may see a "ChatGPT moment" breakthrough in manipulation in the coming years. But if you're expecting large-scale deployment in 2025, I think that's greatly overestimated.

Technology development can't be forced — it must go through: direction confirmation, gradual scaling, product formation, and then large-scale deployment. These stages can't be skipped. Robotics is clearly still in the early exploration phase.

Q: On the other hand, were there things, companies, or phenomena that you think were underrated this half-year?

Yusen Dai: I think first, the value of applications is still being underrated. A year ago, people were still saying model companies would disrupt application companies, that "applications are just wrappers" and this business model wouldn't work. At the time, whether Manus, Genspark, or many other companies, all faced considerable skepticism: "You're a wrapper company, do you have long-term value? Won't you be finished as soon as the model upgrades?"

Now this debate continues, but it's clearly not the case that application companies die when models upgrade. On the contrary, good application companies look forward to model upgrades, so users can enjoy more powerful experiences. The value of the "wrapper" is still underrated.

Second, the value of excellent teams is also being underrated. Whether Kimi, Manus, or Genspark, at the end of the day we're investing in people.

People previously might not have expected that Red could build a world-class AI application. And Kimi K2, released just a few days ago on July 15, is arguably the world's strongest open-source large model at this point in time, bar none. Its performance in coding, Agent workflows, and Chinese writing is indeed superior to Claude. Of course, Claude was released at the beginning of the year, but in AI, six months is a long time.

OpenRouter usage data shows that K2 had only been online a few days — yesterday it was ranked 13th in the programming category, and today it's risen to 10th, behind Claude, Gemini, and GPT. This rate of ascent is extremely fast, indicating very positive user feedback, though people have actually become numb to benchmarks. We care more about actual user feedback.

For example, Perplexity's founder posted on Twitter that their team has begun researching introducing K2 on Perplexity, and explicitly stated: "Kimi is doing very well."

Kimi is the most typical example, but not the only one. We've also discussed the transformation of DeepSeek's team afterward, and Moonshot behind Kimi has gone through considerable reflection, concentrating efforts on the next-generation model. I think outsiders are too quick to draw conclusions very early — for instance, when DeepSeek emerged, saying the six little dragons were done. But in reality, if a team is sufficiently stable, with excellent talent, resources, and will, their agency and room for breakthrough are far more underrated than people think.

Third, I think the speed of model capability evolution is also being underrated. For example, there are already rumors that GPT-5 will release soon, and it may be a natively multimodal model with very strong reasoning capabilities and advanced Agent abilities.

Now when new products launch they often get criticized — people say they overpromised and the actual product experience isn't that good. But good application companies need to design for the model 6 to 12 months out. For example, when Cursor first launched, the models at the time couldn't realize its full vision — it wasn't until Claude 3.5 Sonnet came out that Cursor truly became a usable product.

When Manus was being designed, the best model available was Sonnet 3.5; at launch Sonnet 3.7 had just come online, so Manus could complete some more complex tasks, and subsequent releases of Claude 4, Gemini 2.5 Pro, and other new models further improved Manus's performance. Perhaps 1-2 more major model iterations are needed before mainstream users can fully feel the productivity lift that Agents bring.

So we believe model capabilities will keep improving rapidly — there may well be another jaw-dropping release soon. When that happens, both model progress and application value could exceed market expectations, and the overall pace of AI development will accelerate once again.

Q: One unexpected development over the past six months is the fierce food delivery war. It involves several major platforms — Alibaba, Meituan, and JD.com — diverting significant executive attention and resources. How do you think this affects China's AI landscape, or what impact might it have on startups?

Yusen Dai: For now, these remain two separate battlefields. In the long run, it may have some impact on resource allocation.

That said, Alibaba Cloud's growth outlook still looks quite strong. Jensen just announced that H100 sales to China can resume. We're seeing IDC and cloud services grow very fast in the US this year, because once applications land, inference demand surges.

I think China will follow a similar path. As companies like Moonshot AI, ByteDance, and DeepSeek release better models, more use cases unlock and inference compute demand will explode quickly. Knowledge workers in China and the US are actually quite similar — everyone uses Office, everyone searches, everyone uses deep research tools. The demand scenarios already validated in the US will inevitably take off in China as well.

Of course this has nothing directly to do with the delivery war. But Chinese cloud providers like Alibaba Cloud and Volcano Engine may well go through a growth cycle similar to what we've seen in the US.

Different Paths Forward from DeepSeek

Q: Let's dig into some specific directions. Why hasn't DeepSeek released R2 yet?

Yusen Dai: It's still pretty mysterious — we can only piece things together from peripheral signals. I've heard V4 is still in training. They released V3 first, then R1, so if V4 hasn't even dropped yet and is reportedly still training, R2 would likely come after V4. But I believe DeepSeek is working on genuinely interesting innovations. We've talked to some people there and know their innovative capacity is very strong. Though I do think they're constrained by compute resources. Total GPU supply is finite, and after releasing models, massive compute has to go toward serving inference.

Q: Right, I think they're also figuring out the future direction of models or intelligence itself, and basically still not doing multimodal.

Yusen Dai: Exactly — this reflects a very practical reality: DeepSeek doesn't have enough resources to match SOTA across every domain. So like Anthropic behind Claude, they have to make judgments about which directions matter most and can yield results at this stage, then concentrate resources for breakthroughs. For directions that remain unclear, they can wait until thinking crystallizes, then leverage their engineering capabilities and team effort to catch up.

ByteDance Seed, by contrast, may be trying to lead comprehensively: they have an Edge group doing frontier research, a Focus group chasing SOTA, and a Base group building service products and applications. That's a full-stack layout.

DeepSeek is more selective, prioritizing breakthroughs in model intelligence specifically. When resources are limited, tradeoffs are necessary. Seed's organizational structure, with Edge, Focus, and Base clearly separated, I think is more explicit.

Q: Yes, they do have this fairly clear division of labor now. Edge initially listed five directions, which may have expanded to a dozen-plus projects — quite impressive.

Yusen Dai: I think separating applications from research, then further subdividing research into SOTA and Frontier components, is the right approach. Previously people might just split into Frontier and Applied Research, but resources were insufficient and organizational responsibilities weren't clear enough. The common problem was this: if you have one model team simultaneously doing frontier research to chase or surpass SOTA while also meeting app deployment demands, these two goals easily conflict.

Moonshot AI has summarized considerable experience on this over the past six months. If you have a high-DAU app live, it demands enormous maintenance effort — handling corner cases, fixing bugs. But this work contributes little to pushing the next-generation model or challenging SOTA. This year they haven't invested much in iterating K1 or application features, instead focusing on the next-generation model to expand the boundaries of model intelligence.

Q: This is a question we used to repeatedly ask large model startup founders: how do you allocate energy between building models and building products?

Yusen Dai: At this point, I think you need to push one direction to the extreme first. If you're building applications, assume you'll use the best available models — whoever's strongest and most suitable, use them. But if you're building models, your goal is keeping your own model at SOTA level, making it the strongest in some specific domain.

The Value of Investing in People, and How K2 Turned the Tide

Q: I want to talk about Kimi K2. After DeepSeek triggered a frenzy early this year, Moonshot AI set a clear goal internally: pursue SOTA. Under this objective, K2 should be their first major achievement after making directional adjustments. How do you understand this process?

Yusen Dai: Let me start with a small story. A few days ago, Anyong organized a roundtable in Liangzhu, bringing together investors in Moonshot AI and MiniMax. I joked that it felt like a "sympathy gathering," as if we who invested in large model companies deserved pity. But I think what truly demonstrates a team's capability is how they respond to challenges, and whether they stick to their path to do valuable innovation.

There are ways to play when the wind is at your back, and ways to play when it's against you. Take MiniMax — they stay focused on their direction and are now advancing their IPO process. That's their way of facing challenges. Of course, some companies experience significant internal team changes in headwinds, even shifting business direction. But what's special about Moonshot AI is their team stability. If you look at their founding team or core members, there's been virtually no major turnover.

Q: This is actually my first point of curiosity. Many companies see changes at the co-founder or core business leader level now, but Moonshot AI has barely changed.

Yusen Dai: This probably relates to their team composition. Moonshot AI's founding team has always been centered on Zhilin Yang, with members who are old Tsinghua classmates — they've collaborated extensively, were even roommates, and played in a band together. This wasn't a team assembled ad hoc to start a large model company.

Q: When I looked into the records, I found that when Zhilin Yang was up for a special award, there was a photo of classmates from his department holding banners to support him. Several junior students from his department would say that Yang was already someone who struck them as remarkably charismatic at that time.

Yusen Dai: Exactly — this was one of the important reasons we decided to invest in them from the start. This team doesn't just have technical ability; they have long-standing mutual trust. Entrepreneurship involves many challenges, like stress tests. When facing challenges, team stability and directional focus become crucial.

The Moonshot AI team has been fundamentally built around technical DNA and technical vision from day one. Many people may have forgotten that in 2023, AI was changing so fast that nearly every month brought a new trend. At that time, Moonshot AI made an important judgment on the long-context direction, choosing to build a model with long-context capability and launching the first version of Kimi with search functionality based on this.

Back then, a large number of AI chatbots actually had no search function. Without search, a model's utility was severely limited — ask something as simple as "who is the current US President" and it might fail to answer.

The Moonshot AI team's judgment on the long-context technical direction demonstrated their very strong technical vision. By 2025, as AI increasingly emphasizes Agent capabilities, complex task execution, and handling larger codebases, people are only now truly recognizing long context's importance. If you really want AI to act as an Agent completing complex tasks, it can't just execute 100 steps and stop. Looking back, this validates the accuracy of Yang's judgment.

So I think this team has three especially important qualities:

First, team stability. This comes from members' long history together and mutual trust.

Second, persistence on technical direction. They don't chase whatever's hot — companionship apps become trendy so they do companionship, multimodal becomes hot so they pivot to multimodal. Many directions are viable, but teams that truly accumulate capability are those that persistently do one thing.

Third, they maintain technical sensitivity and insight. This matters especially at critical inflection points in technology evolution.

Of course R1's success did give the industry significant inspiration — no doubt about that. At the time, a popular interpretation was: "pre-training doesn't matter, post-training does." But I think R1's foundation was V3, which precisely shows that good base models matter — pre-training and overall architecture optimization can drive model capability improvements. K2 is still a non-reasoning model yet already demonstrates strong capabilities, which also shows pre-training remains important.

Open source matters too. The global AI community is currently in the Early Adopter phase — teams that provide value to the community through excellent open-source models and products readily receive enthusiastic response. We've seen this with DeepSeek, K2, and open-source projects we sponsor like vLLM and ControlNet: as long as you consistently deliver good products and maintain open communication with users, users worldwide will appreciate and support you, even proactively helping improve things. But open source alone isn't the essence — the essence is open-sourcing good things. Open source isn't automatically good; the community finding it good is what matters.

Q: Speaking of open source, since K2 is 1 trillion parameters, very few in the open-source community can actually deploy it.

Yusen Dai: K2's primary scenario isn't local machine deployment — in fact, running DeepSeek R1 at full scale locally isn't simple either. The core of open source isn't local deployment; it's giving people more autonomous control over the model.

Q: Can you summarize why they were able to produce a model at K2's level? What challenges did they encounter in practice? Did external sentiment putting them in a low point create significant pressure, such as talent attrition? I'm talking not just about the core layer, but frontline engineers as well.

Yusen Dai: There was definitely some attrition, but Moonshot AI had two advantages: first, the core leadership remained stable; second, many younger colleagues wanted to stay.

I think people stayed not just for the money, but because they could learn things and accomplish work they genuinely felt proud of. That fits the DNA of Moonshot AI's core team.

We discussed this during the user acquisition phase — they knew marketing wasn't their strength, so they focused on what the team did best. I think that matters a lot. When facing external market pressure, don't panic; focus on what you're best at, what you can execute best, and what matters most. Several Moonshot AI researchers even posted on Zhihu sharing their reflections on working on K2.

Technical Shifts: Reasoning, Coding, Tool Use

Q: In the technical domain, what changes are you paying closest attention to?

Yusen Dai: As we discussed before, I see three main threads unlocking AI productivity: reasoning, coding, and tool use.

On reasoning, we've seen the releases of o3, o4 mini, and later o3 pro. While these models haven't shown dramatic gains on some benchmarks, our own experience is that o3 represents a clear step up from o1, and o3 pro keeps improving on inference length and logical coherence. Many models are now making incremental advances in reasoning details — fewer hallucinations in specifics, more rigorous outputs.

We're also seeing smaller models develop strong reasoning capabilities. Metrics like GPQA or AIME that reflect model reasoning ability are now very high. K2 also performs well here.

Q: In our conversations with others, like Alibaba Cloud CTO Jingren Zhou, he didn't see the o-series updates as a major paradigm shift. He viewed them as natural extensions within the existing large-model methodological framework.

Yusen Dai: I agree with that. If these models are still Transformer architectures, they're evolving within the current paradigm. Everyone's waiting to see what comes after Transformer.

But sometimes a single paradigm can carry you very far. Highways last for decades — you don't need disruptive architectural changes every year to call it innovation. In fact, if there were disruptive breakthroughs annually, it would suggest the industry is still too unstable for real commercial deployment. The technologies that will actually matter for industry this year aren't 0-to-1 changes; they're more like 1-to-10, or even 5-to-8 evolutions. Reasoning improvements, for instance, are about going from very good to excellent.

In coding, Sonnet 3.5 was already quite capable, but context length was insufficient and self-correction was mediocre. Sonnet 3.7 and 4 running on Claude Code work remarkably well. For complex, lengthy code segments, they often get it right in one shot. This isn't 0-to-1; it's a quality leap from 7 to 10.

Q: In the foundational model competition, Google has built strong momentum recently. When OpenAI burst onto the scene two years ago, Google seemed somewhat stunned. But now Gemini 2.5 has strong word-of-mouth and real-world feedback.

Yusen Dai: Indeed, Google has deep technical accumulation, high talent density, abundant capital, and plentiful compute. So we've clearly felt Google's marginal improvements this year.

At the model level, Gemini 2.5 performs very well. At the cloud services level, GCP actually delivers better performance for the same Claude API inference services, which owes much to TPU support. Google is very strong — one of the most competitive players among the top three in the model space right now.

But they also face a real problem: their core search business is under pressure. Concerns about AI's impact on search advertising have kept their stock volatile. This is a classic example: the legacy business is being damaged while the new business grows rapidly. How this ultimately plays out, I think, will take another year or two to see clearly.

Applications Growing, Making AI More Than Q&A

Q: This brings us back to a theme you raised earlier: the relationship between models and upper-layer applications, which continues to evolve.

Yusen Dai: An application's value first depends on the model itself — those foundational capabilities baked into the weights. The stronger a model's reasoning and coding abilities, the more value applications can unlock. But once weights are fixed, the content is static while problems are dynamic, so context must be introduced. The current popularity of Context Engineering versus Prompt Engineering shows that prompting models alone isn't enough; better context is needed.

I think context operates on three levels:

The first is general information — things like "what's the weather today?" that the model doesn't inherently know and must dynamically retrieve through search or similar means. While some models can now perform simple searches, this requires equipping them with appropriate tools.

The second is organizational — internal processes, documentation, accumulated knowledge. The model doesn't know these either; applications must collaborate with the model to guide people in accessing this information. Personal conversation history, preferences, background — the model lacks these too, so the application layer must provide them.

So this context layer is supplied by applications, and its quality makes an enormous difference in AI application performance.

AI's goal isn't just to become a Q&A machine; ultimately it must actually help users get things done. Which tools it can call, what outcomes it can affect — these are also provided by application-layer companies. For example, what public or private MCP tools the product offers, or what environments the AI can deploy its outputs to.

The model is really just the bottom layer. Because when ChatGPT first launched, most of our use cases involved "asking" the model, so we could only extract answers from its compression of existing knowledge. Factual questions, for instance, do rely primarily on the model itself. But when tasks grow more complex and the model's intelligence needs to work with context, even environment, to be effective — that's where the "shell" becomes valuable.

Q: So you see this as a natural evolutionary path? There's no need to insist "we're a model company"?

Yusen Dai: Right, models are definitely important, but relying on models alone probably isn't enough to fully unlock value.

Q: How did people view Google a year ago? A laggard?

Yusen Dai: People definitely thought Google was somewhat behind, overshadowed by OpenAI, with talent leaving. But after Google co-founder Sergey Brin returned, many things changed. There were rumors, for instance, that Noam Shazeer — founder of acquired Character.ai — went back and personally fixed a bug that dramatically improved model performance. Hard to verify, but key talent can indeed solve critical problems.

Q: So Google's rapid catch-up may involve not just technology but also organizational approaches and intensity of investment?

Yusen Dai: Right, they're taking this very seriously. I hear the Gemini team is working intense hours too. Google used to be seen as a retirement home, but now they're grinding hard.

Q: Model competition has actually activated a lot of smart people; the sense of pursuing meaningful achievement has returned.

Yusen Dai: I think so. The founders of these companies all take AI extremely seriously. It's no longer "will AI materialize?" — it's must-win.

Whether Zuckerberg, Sergey Brin, or the teams at OpenAI and Anthropic, they all see AGI as imminent, recognize its importance, and are willing to spend and invest resources.

Y Combinator's recent batch also noted that when building any company now, you should operate under the assumption that "AGI arrives in two years." You need to ask yourself: If AGI happens in two years, what should your company do?

Of course there's still much debate about what AGI even means, but there's no doubt massive change is happening, and happening fast. Computer science students in Silicon Valley are already finding it harder to land jobs, because entry-level programming work has been significantly displaced by AI. These changes are very real.

Q: Let's return to your three main threads. We've covered reasoning and coding; now tool use. Recently both Kimi K2 and Grok have incorporated tool-use capabilities during the training phase itself. Is this a new trend?

Yusen Dai: There are currently two main approaches to AI tool use:

1. API-based interfaces like MCP;

2. Visually simulating AI operating existing software.

Both paths have their practitioners. The MCP ecosystem is already established, with increasingly more tools being built for AI. Meanwhile, products like Manus and OpenAI's Operator use browsers within sandboxed virtual machines, operating existing software through visual perception to simulate human usage processes — all aimed at enabling AI to better leverage existing software capabilities.

Being able to use human tools to complete tasks — I think this is crucial for making AI genuinely useful.

Agents Teaching Everyone to Be a Good Boss

Q: In OpenAI's original five-stage roadmap, the third stage after reasoning was actually Agent.

Yusen Dai: Yes, Zhang Xiangyu had a great podcast on this that I really agree with. The first stage is chatbot, corresponding to ChatGPT; the second is reasoning, corresponding to the o-series models; the third stage, Agent, corresponds to Agent-native models — though those haven't truly emerged yet.

In the definition of Agent, the goal is something AI seeks out itself, but currently goals are still given by humans. Agent means: I give you a goal, and it predicts the sequence of tool usage, selecting what tools to complete the task. It may not yet be able to break down tasks and define goals on its own, like delegating to an employee.

AI Agent products are still in a very early stage. Manus only launched a few months ago, but I think within a year or even six months, as model capabilities improve, this category will get dramatically stronger.

What I want to say is that different companies, because of different resource endowments, will approach the Agent problem differently. We try not to make predictions, not to believe we can know the future in advance. For example, Moonshot AI's view is Model as Agent — adding large amounts of end-to-end tool-use data during model training so the model itself gains strong tool-calling capabilities. Meanwhile, products that similarly call closed-source model APIs — Manus proposed "less structure, more intelligence," though sometimes structure can also improve work efficiency. Genspark specifically built slide generation for PPT scenarios, introducing a series of methods to optimize work output.

Q: Both angles are valid. For users, some scenarios have rough workflows, making outcomes more controllable and costs lower.

Yusen Dai: Because users want the final result, and different companies trying to achieve this result may take various paths. Some are flexible but costly, others fixed but cheap. So everyone solving the same problem with different methods — that's all reasonable.

Q: The biggest trend you're describing is undoubtedly still Agent?

Yusen Dai: It's AI's improvement of productivity. To genuinely raise productivity with AI, you have to let AI take on more work. Products like Claude Code, Manus, and other Agents — the core philosophy is humans don't work, AI works.

Some say this is like an L3 autonomous driving product: the person doesn't touch the steering wheel, the car drives itself. We found that engineers writing code initially liked Cursor because it still lets you write code in a familiar IDE. But Manus observed that product managers using Cursor to complete tasks barely looked at the code, only at the chat panel on the right — so they put the chat panel in the primary position, creating an Agent more suitable for non-programmers.

As model capabilities advance, Claude Code takes it further — users can't write code at all, they can only tell AI what they want done, and AI handles the rest. So L3 or Agent means AI becomes the protagonist of execution, while users need to learn to be a good boss to AI.

Q: That's pretty hard for many people, a barrier. When AI's work is unsatisfactory, having it try multiple times still doesn't work.

Yusen Dai: I used to think this way when I was an entrepreneur too — I had to do everything myself. Later I realized this isn't good management. I should empower subordinates, let them know what I want, give them initiative.

In the future, humans directing AI may work the same way. This may be the first time in human history that everyone needs to cultivate a tool. Before, cultivating people was hard — most people are cultivated, few have the ability or opportunity to cultivate a subordinate. But now everyone may need to learn how to give AI commands, how to train AI to complete work better.

Q: You mentioned that general-purpose Agents like Manus and Genspark have fairly broad user bases. How do you view Agents in vertical scenarios?

Yusen Dai: General-purpose because current model capabilities lean general, but certain vertical scenarios will definitely gradually emerge.

I think a good product ultimately needs clear positioning — it has to be absolutely number one in certain domains to have long-term value. Or rather, our goal isn't pursuing general-purpose, but starting from general-purpose and gradually converging on core scenarios.

In the early stages of a technological revolution, everyone is often experimenting, not knowing what the new technology is suited for, and finally seeing what works best. For example, when the steam engine was first invented, it was initially used to pump water from coal mines; later people discovered it was better for driving trains and textile machinery. The steam engine was also a "general-purpose technology," but its greatest value ultimately came from a few specific scenarios.

I think coding, office work like making PPTs, and deep research — these three are undeniably important directions that have already emerged.

Q: Here's an interesting topic. Domestically, when people discuss general-purpose products, they see it as a battleground for large companies. But talking with foreign investors, they're more interested in the possibility of a Super App, concerned with how to beat OpenAI and Google.

Yusen Dai: If you have the opportunity to challenge large companies, that's a good thing — at least you qualify for the Olympics, which is better than not participating.

Something very interesting: after Manus appeared, many people said it had no moat, that you could build it over a weekend with open-source frameworks. But now so many weekends have passed, and no similar application has actually been built well.

I think in the global market, people still respect true innovation — they won't directly copy an identical product. They may borrow interaction or presentation concepts, like the visual form of AI doing work, but won't do pixel-level replication. In global competition, first-mover advantage brings significant word-of-mouth and distribution advantages, which is also a reward for innovators.

$1,000 Monthly AI Product Subscriptions

Q: Have you calculated how much you spend on AI product subscriptions per month?

Yusen Dai: Roughly close to $1,000. Manus is $200, Genspark is $200, ChatGPT, Gemini, Grok — all roughly $200 each. I basically buy the premium plans.

I've always had a philosophy: try new products extensively. Often spending a bit to try things out isn't excessive. The revolutionary aspects of many AI products can't be seen just from reports — you have to use them personally. When you can see a future, you generate a lot of inspiration.

We observed in March that after Manus launched, inference usage exploded — Agent products' token consumption significantly exceeded chatbot usage. At the time, many in the secondary market were still questioning NVIDIA, thinking even if everyone in the world used chatbots, inference demand wouldn't be that large, wouldn't need that much compute.

But this is like the dial-up era — initially everyone was just chatting on QQ, not needing much bandwidth. But once broadband arrived, people wanted to stream 4K video online. The stronger model capabilities become, the more scenarios they unlock, the more tokens get used.

Q: In 2023, Jensen Huang said in an internal NVIDIA speech that their market cap target was $2 trillion. At the time NVIDIA had just broken $1 trillion. We were still debating whether he was being too boastful. This year it already broke $4 trillion.

Yusen Dai: He'll likely hit $5 trillion soon. Because the trend of tokens converting to productivity has only just begun.

It's like a train that's started moving — it won't suddenly stop. We're still constantly discovering new AI use cases. For example, an engineer who used to write 100 lines of code per day — now with Cursor, Claude Code, they might write 10x that amount, solving more problems they never thought to tackle before. Or with ChatGPT and Manus, the questions you'll ask also multiply.

Many questions you previously didn't know who to ask, now you can use AI to solve. The productivity gains for users make them more willing to pay.

Q: Currently token consumption in productivity scenarios is very high?

Yusen Dai: Productivity can grow 10x, 100x. No matter how much you casually chat with AI, there's only so much time in a day — this is what we used to mean by attention is all you need. If what you want is user attention, it's limited, and you're competing for it against TikTok, Xiaohongshu.

But in productivity scenarios, the ceiling for user demand is very high — it can go from asking one question to asking 100 questions, with compute needs increasing 100x.

Q: And the complexity of token consumption per unit time is also rapidly increasing — for example, the content and visual information I want to consume may also become more complex.

Yusen Dai: In the future you can ask AI extremely complex questions you hadn't thought of before. Let me give you a simple example. Friends in US equity secondary markets, during earnings season, might need to track five or six companies reporting in a single day. Waking at 4 a.m. to read earnings data, inputting into models for comparison, listening to earnings calls, analyzing the CEO's outlook — this is their daily routine.

Previously they couldn't simultaneously listen to multiple earnings calls, so they had to hire more people or be selective. But now with AI, though it can't yet fully automate the entire workflow, within 6-12 months it may be possible for one analyst to cover 50 stocks' earnings simultaneously.

AI can help them read earnings reports, listen to calls and take notes, answer pre-prepared questions, summarize CEO responses, write reports. These were things previously left off your work schedule because they were "impossible" — now AI can do them, so demand naturally grows.

Before airplanes existed, no one would say "I'm flying to the US for business today." But once airplanes existed, new demand emerged. AI is the same — it gets you doing things you otherwise wouldn't have considered.

Q: Time is limited. But the complexity of entertainment content, sensory stimulation per unit time could also increase dramatically — this was also previously hard to imagine.

Yusen Dai: Yes, truly hard to imagine. But what I want to say is: productivity value is directly measurable. If AI helps me earn $100, I'll pay it $1 or $10.

And we've observed a very interesting phenomenon: when AI charges by token usage, many people actually hope to use more. Because it's genuinely helping you complete work — like writing more code for you.

This was work you had to do yourself, spend time on, spend money hiring people for — now AI does it for you, so it has value.

The Acqui-Hire Talent War in Silicon Valley

Q: What's your take on the recent "talent war"?

Yusen Dai: A lot of people have indeed been poached recently, and plenty others got the call but didn't go. Massive numbers of top-tier talent are being lured away by disruptive-level compensation.

This kind of raiding is a huge shock both to the teams being hollowed out and to Meta's existing teams. Similar turmoil is happening at virtually every top company in Silicon Valley. The morale of teams that lost people inevitably wavers, and those who stay start wondering: should I be getting a raise too?

This high-price poaching certainly reflects the value of talent, but the more elite the talent, the more time and environment they need to truly gel and become a cohesive force. History is full of failed examples, so for these organizations, this is both an opportunity and a challenge.

Q: Do you think this competition for talent is... ethical?

Yusen Dai: I think it also reflects founder mentality — the willingness to do whatever it takes to secure talent. If money can solve it, spend the money. It shows that talent really matters.

Q: Is the pressure mainly in Silicon Valley? Though from another angle, this also gives startups decent exit opportunities.

Yusen Dai: But these exits may not be big enough. Some people think selling a company for a few hundred million is fine; others want to build hundred-billion-dollar companies. Startups also need more ammunition to compete with giants like Meta. Take Cursor — they raised a ton of money, and we once wondered what they needed it all for. Now we see they're facing subsidizing user token costs and paying even more for talent, so raising more makes complete sense. Competition has escalated on both subsidies and talent. Top talent has plenty of options, and for many startups, the barrier to entry and the waterline for joining the fray are both rising.

Q: Acqui-hires are very popular in Silicon Valley — mainly to circumvent antitrust restrictions, and it shows how intense competition has become.

Yusen Dai: Everyone wants to move faster because there's too much money. Several giants have massive cash on their balance sheets. Deploying that capital — if it can buy time and competitive advantage — is a simple equation for them.

Benchmark Stagnation and Pushing the Boundaries of Intelligence

Q: For this final section, I'd like to ask about your personal reflections. It's been two and a half years since GPT-3. What are you still genuinely curious about?

Yusen Dai: I remain curious about many things. First: how to measure the boundaries of intelligence.

Think about it — when ChatGPT first launched, humans could still point out flaws in its answers. But now, whether in research depth or writing quality, ordinary humans increasingly struggle to identify its shortcomings. As human intelligence is gradually approached, how do you measure something potentially smarter, more deeply thoughtful, with better memory than you?

On measuring intelligence, my good friend Shunyu Yao wrote in his essay The Second Half that AI benchmarks will become increasingly important in the future. Current benchmarks have stagnated — they can no longer accurately distinguish between models. Does scoring 85 versus 90 on a benchmark really reflect model differences? Moonshot AI's experience also shows that well-crafted internal benchmarks matter enormously. The crux of model training is how you measure results, and the quality of internal benchmarks often determines model quality.

So I believe measuring intelligence and exploring its boundaries remains critically important. For now we can still barely say we "vibe test" models to sense capability differences, but in a few years, when the top five models may all be smarter than you, how do you evaluate which is better?

Q: So how do you measure the boundaries of intelligence?

Yusen Dai: The second thing I keep coming back to is the logic of productivity. When everyone possesses massive productivity, what does that mean for individuals, organizations, and the world at large?

For individuals, super-individuals can do more and more alone — from small apps like cat fill lights to developing games, even to Sam Altman's predicted "one-person unicorn." When Instagram was acquired, it had just 13 people; in the AI era, going from 13 to 3 is entirely plausible.

This means gaps between people will widen dramatically. When everyone has infinitely smart assistants and "cyber beasts of burden" like Manus working tirelessly for them 24/7, some will leverage them to create enormous value while others won't — so divergence in growth rates will accelerate.

For organizations, small ones can become incredibly powerful, while large ones can use advanced technology to manage bigger, more complex operations. Meituan manages millions of delivery riders across enormously complex operations — impossible without advanced internet communications and management technology. Add AI, and big companies' management scale, operational complexity, and depth all rise another level.

The world is driven by organizations, and expanding organizational capability boundaries has massive global impact. Going further, when overall productivity surges while gaps between individuals and organizations widen, how do we balance efficiency and fairness? AI started as tools made by the smartest people for top users — how does value created by these elites flow back to ordinary people? Even if an average person doesn't diligently study AI, how do we make AI products increasingly accessible so ordinary people benefit too?

Q: AI also brings privacy exposure and misinformation proliferation, blurring the boundaries of what's real.

Yusen Dai: Right — it's hard to distinguish what's a genuine article. I can still detect DeepSeek's flavor now, but in a year I probably won't be able to, or rather, there are probably already many AI-written articles I can't identify — I only catch the ones that aren't disguised well enough. Everyone knows AI cites false content, but what counts as false? The line between real and fake is increasingly blurred.

I keep wondering: what is humanity's greatest limitation? My answer is brain power — about 20 watts. That's the ceiling for human intelligence. AI can rapidly approach and even surpass this ceiling. The more intelligence we have, the questions of what we do with it and how humans and AI divide roles are ones we'll soon confront and adjust to.

Sometimes I find it a bit frightening, because massive change has already arrived — people are just gradually feeling it. The impact on programmers is especially direct: ordinary junior programmers who don't use AI will find it very difficult to land jobs in two years. But a year is too short for people to change. And many more professions may face massive disruption in very short order.

Talk Is Cheap, Show Me the Product

Q: What fatigues you?

Yusen Dai: First, excessive marketing. There's been a trend these past few years of products over-marketing with clickbait shock headlines. But good products like Manus spent almost nothing on marketing — though people mistakenly assumed they spent a lot.

Previously, AI progress was mainly model progress that ordinary users couldn't yet experience in products, so researchers and media held disproportionate explanatory power. But now models are beginning to translate into applications. I posted on Jike then: "Talk is cheap, show me your product."

The core capabilities of many AI models must ultimately convert into products that users can actually use — that's when AI comes alive. Companies that just tell stories and hype things up should focus on building products instead. We're seeing that the AI companies thriving now have largely done this — delivering real value to customers.

Q: What are you trying to validate this year?

**Yusen Dai: One thing is whether L3-level AI Agent applications can quickly reach the point of completing real work. We're all using Manus, users are paying, but sometimes a task only scores 70 or 80 — humans still need to get it to 100. Claude Code, compared to previous coding agents, is gradually reaching the point of completing things in one shot, deployable without revision.

In the coming months through year-end, I believe Agent capabilities will improve dramatically. By then, you might give AI one instruction and it just cranks away, getting it right the first time.

Q: My experience with AI now is that I have to deliberately use it more. Because when I give it complex tasks, it doesn't complete them perfectly.

Yusen Dai: This is universal. Good AI products are designed for future models. Cursor launched two or three years ago, but only took off when Sonnet 3.5 came out, then exploded with 3.7.

Manus is the same — many tasks didn't work well at launch, but in 6 or 12 months, new model generations will make it perform better. So design for the future, not for the models available today.

Q: This may feel counterintuitive for ordinary users, but I understand — we're in a rapid development phase. When it reaches mainstream users, they'll still expect it to work out of the box.

Yusen Dai: Actually, not necessarily. In our recent conversation between Manus and YouTube co-founder Steve Chen, Steve said YouTube was designed for the broadband future. In 2005, America was just beginning to roll out broadband, so the initial experience wasn't great. The same was true for short-video platforms like TikTok and Kuaishou — they were designed for smartphones and 4G networks that wouldn't be ubiquitous for another year or two. AI is the same: you always need to stay one step ahead. As Jobs once said, "Skate to where the puck is going."

Q: So what you're trying to validate is whether, by year-end, the product can complete tasks with high automation, no longer requiring human involvement?

Yusen Dai: Right now, for example, an agent's task completion rate might be 20%. Can we get it to 70-80%? That would dramatically change how frontier users define work and how they use AI.

I'm also curious how much memory will matter as people use AI products more. What's the long-term moat for AI applications? I think memory and personalization are crucial. Right now, memory and personalization have limited impact on results. But long-term, we want it to be like an employee or assistant — the more you use it, the more it understands you, until it becomes irreplaceable. That's the progress we want to see.

Q: This progress can't rely on models alone, right? Memory requires continuous interaction, giving the AI personal-level context.

Yusen Dai: On one hand, there's online learning — the model learns from use. On the other hand, you need to feed it more data, files, and context. Application design is critical. Models and applications need to work together.

I think we need to have greater expectations and tolerance for future innovation, and more confidence and support for Chinese teams' capacity to innovate and grow.

New Observations on the New Wave of Entrepreneurship

Q: What types of teams are you paying closest attention to now? Where else are you seeing new founders emerge?

Yusen Dai: This year we're seeing far more people who want to start companies. A year or two ago, people were still believing that applications would eventually land. Now they've seen Manus as a proof point. With that precedent set, people naturally think, "Maybe I can too." This is definitely a growing trend. We're seeing many researchers and young people at big companies getting restless.

Q: What interesting books have you been reading lately, or any works you'd like to share?

Yusen Dai: I'd like to recommend Clair Obscur: Expedition 33, a game developed by a French startup. The story is set in a fictional world where there's a god called the Paintress. Every year, she writes a number on a stone at the edge of the sky. This stone is called the Monolith.

She counts down from one hundred. Each number she writes, everyone who has reached that age dies — completely vanishes. First year: 100. Second year: 99. And so on, wave after wave of death. So humanity begins to resist, organizing expeditions each year of people whose remaining ages differ by one, trying to challenge and break this curse. But they never succeed. This year, the Paintress wrote 33. People who are 33 will die at this time next year. And so the 33rd Expedition sets out.

A few days ago was Manus founder Red's 33rd birthday, so I recommended he play this game. It's a story about someone who just turned 33 sailing out to challenge their fate. The development team also has 33 people — it's a startup. The founder used to work at Ubisoft, got bored there, and left to make this game.

Black Myth: Wukong is a major Chinese IP, a premium product combining China's cultural depth with advanced technology. Clair Obscur: Expedition 33 is similarly a work of French romantic imagination, an outstanding product combining art and advanced technology. Its plot, visuals, music — everything is exceptionally well done. It's one of my favorite games this year, and a contender for Game of the Year.

The audio version of this episode is also available on the ZhenFund podcast This Is Serious — tune in!