Chen Yu in Conversation with Ziwang, DeepRoute: AI in the Physical World — How to Tackle the Toughest Nuts to Crack | Yunqi Capital Doers

云启资本·December 10, 2025·52·0

Bridging the Chasm from Virtual to Reality

In 2025, the narrative around physical AI is shifting. Embodied intelligence and assisted driving are no longer just "storytelling" industries — delivery volumes, production line deployment, revenue, and orders are becoming the focal points of conversation, with market discussions increasingly centered on "how fast can commercialization run?"

But beyond commercialization, a fundamental question still demands constant examination: As AI accelerates its deployment in the physical world, is the technical foundation beneath our feet fully solidified?

At the recently held Zero2IPO "25th Annual China Equity Investment Conference," Chen Yu, partner at Yunqi Capital, sat down for an in-depth conversation with the founders and CEOs of two portfolio companies — Wang Qian of X Variable Robot and Zhou Guang, co-founder of DeepRoute. This roundtable, titled "Innovation and Boundaries: AI Landing in the Physical World," brought the discussion back to more foundational technical questions:

How do we bridge the gap between digital models and physical action? What are the necessities and limitations of language, touch, and vision in improving physical AI's learning efficiency and ensuring safety?

Of course, the guests — all on the front lines of the industry — didn't shy away from what the market cares about most right now: What commercialization pace and scaling path will physical AI take? New insights and industry干货 — this edition of "Yunqi Doers" shares it all with you.

Key points

Physical AI Foundation Models and Positioning

The physical world differs vastly from the virtual world, requiring a foundation model born entirely of and serving the physical world, existing in parallel to language models.
The total knowledge required for physical models is far less than that of large language models (LLMs), so their parameter scale is expected to be smaller than language models.

Key Challenges of Landing in the Physical World (Sim-to-Real Gap)

The biggest hurdle is the "physical gap": running well in simulators doesn't mean a model works in reality. The core issue is that simulators don't accurately model physical laws.
The uniqueness of manipulation: Manipulation differs fundamentally from local motion and navigation; even tiny errors (such as friction, collision) lead to completely different outcomes, making it potentially the last piece of the AI puzzle to be solved.
Real-time requirements: Embodied intelligence is essentially a real-time system requiring instant feedback — it can't spend time deliberating over multiple paths like large language models.

Data Strategy, Pre-training, and Modalities

Language serves as critically important supervisory information during training, helping models learn and converge quickly (e.g., clarifying why a vehicle stopped); at deployment, it enhances system interpretability and user reassurance.
Future physical world foundation models need to become less dependent on language — language struggles to describe fine spatial relationships and short-duration action changes — though the language modality will persist.
Tactile information is "extremely important," but not absolutely necessary. Pure vision methods can extract substantial information, yet manipulation performance falls far short of what tactile-enabled systems achieve. Extensive tactile training is still needed to approach or surpass human-level performance.

Commercialization and Scaling Timeline

Autonomous driving at scale: Achieving end-to-end requires roughly 10,000 vehicles; achieving VLA requires approximately 100,000 vehicles.
Embodied intelligence commercialization milestone: 2026 will be a landmark year for embodied intelligence commercialization, with genuinely ROI-positive scenarios emerging at scale.

Chen Yu

I'm very glad to be here today with two entrepreneurs on the front lines of AI landing in the physical world, to discuss a topic of great importance right now — how AI moves from model capability to actual presence in the physical world. Over the past two years, we've witnessed the rapid development of generative AI. Models have gotten smarter very quickly, but getting AI to execute reliably in the real world requires an entirely different set of system capabilities.

Today we've invited two unicorn companies in the physical AI space — X Variable Robot and DeepRoute — to explore together the challenges AI faces when entering the physical world, and the possibilities that lie within. First, please introduce your respective companies and what you're working on.

X Variable Robot

DeepRoute

Wang Qian

X Variable is a company building embodied intelligence foundation models and general-purpose robots. We work on robot brains, complete machines, and further upstream on dexterous hands. At our core, we're an AI company, a foundation model company.

Many people still think of embodied intelligence as a vertical application of AI, or an extension of large models into the physical world. But the physical world is fundamentally so different from the virtual world that what we need is a foundation physical model born entirely of and serving the physical world, existing in parallel to language models and multimodal models. X Variable's positioning is first and foremost a foundation model company, second a humanoid robot company, ultimately providing customers with integrated hardware-software products directly面向终端消费者 and clients.

Zhou Guang

DeepRoute has been in autonomous driving for a long time, witnessing the industry's ups and downs. Early autonomous driving was based on high-definition mapping technology and modular systems, then came end-to-end, and more recently the hot topic of VLA. Autonomous driving is the first robotics field capable of achieving volume and doing pre-training well with massive datasets. We've also found that the gap between the digital world and physical world is enormous — building a data model in the physical world presents tremendous challenges. Currently, we've achieved quite good results in this regard; our products have crossed the laboratory demo stage and are successfully serving a broad consumer base.

To date, a cumulative 200,000 vehicles equipped with DeepRoute's assisted driving system have entered the consumer market, with million-level vehicle adoption expected by 2026. Having witnessed massive-scale data and gone through the journey from model difficulty to action difficulty, we have quite a few reflections.

Smart Models Don't Equal Capable Action:

The Core Bottleneck of AI Landing in the Physical World

Chen Yu

Let's dive deeper. Building models is already very difficult, but turning models into truly action-capable systems that can land in the physical world is even harder. Running well in a simulator doesn't mean the model actually works, because entering the physical world brings all kinds of problems.

So I'd like to ask both of you to share: Where is the biggest difficulty when AI lands in the physical world? What problems have you encountered that didn't exist in laboratory environments? How do you solve the bottleneck of getting digital models to work in the physical world?

Zhou Guang

We have over 200,000 vehicles on the road, continuously collecting first-person-view data. With the massive amount of data gathered, the primary task is to do pre-training well — and that's not simple. With end-to-end plus language models, because language itself encompasses rich semantic knowledge, the demand for data volume drops significantly, and the learning speed of pre-training improves noticeably.

In autonomous driving, neither Tesla nor we use simulators that much. We still believe in doing pre-training first, then using reinforcement learning to improve those final few key metrics — not going straight to reinforcement learning without pre-training. And reinforcement learning's supervisory signal is extremely sparse, usually only giving an overall reward at the very end. It's like when we're driving: braking 0.1 seconds too early or too late can make a person very uncomfortable, and these situations are hard to describe through a simulator.

We've observed that introducing reinforcement learning on the foundation of adequate pre-training can effectively improve those final few critical safety metrics, thereby enhancing system safety.

Wang Qian

Deploying models in the physical world mainly involves three broad directions: the first is Local Motion, which everyone has already done very well; the second is Navigation, which has also largely been figured out.

Zhou Guang

To a large extent this relies on high-definition maps (SD Map). In complex environments like residential compounds and office buildings, operating without SD maps is still quite difficult. But now, with the support of VLA models, there's a certain success rate, because VLA has already demonstrated some autonomous navigation capability in unknown environments.

Wang Qian

Or rather, autonomous driving has basically covered Navigation in the general sense. But Manipulation — which is what we mainly work on — is fundamentally quite different in nature from those two domains.

First, data is harder to collect; there isn't that much training data. From the nature of the problem, there's another very serious issue: the physical processes involved are quite complex. We can imagine that autonomous driving doesn't involve that many physical processes; it's basically a perception problem, and the control involved doesn't have much randomness or many unpredictable processes.

Slightly harder is the Local Motion mentioned earlier — doing flips, running, dancing. But relatively speaking, Local Motion is also simple, because it's always contending with a continuous, constant gravitational field. Being slightly off in ground contact doesn't matter that much.

But hand manipulation, at this level, differs from all other domains in one crucial way: even very tiny errors can lead to vastly different final outcomes. These tiny errors are caused by physical processes that are ignored in everyday situations — slight friction, collision processes.

From this perspective, the point President Chen raised earlier is very important. Why does everything run so easily in simulators, basically without any problems, yet the so-called Sim-to-Real gap appears as soon as you get to the real world? Of course there are certainly many so-called perception gaps — what you see is different from what's in the simulator — but the main issue is still the physical gap: the physical laws simulated in the simulator are wrong.

The absolute rigid body problem still hasn't been fully solved to this day; there are still clipping issues. Under very high-frequency simulation, there are imprecisions in collision detection. Very tiny imprecisions still have very large effects on the final result. This is what's so special about Manipulation, and why it's the last of the three domains mentioned to emerge, and also the last to emerge in the AI field.

Zhou Guang

For humans grasping things, tactile sensation is extremely important. This kind of tactile information is typically collected through three different types of neurons in the human hand. Current robots mainly simulate this function through pressure sensors, but they haven't reached human-level collection precision.

Wang Qian

Sensors are one aspect; actually, sufficient information can be obtained through pure vision methods — you don't necessarily need perfect tactile sensation to do it. But the prerequisite is that the process itself must be correct. If even with very good data in the simulator, the results are poor, that's because the physical process itself is wrong. Obtaining the correct physical process is extremely important.

Vision, Touch, and Language:

Why None Can Be Missing

Chen Yu

If everyone believes pure vision can solve the problem, video data is indeed relatively the easiest to obtain. But lacking tactile or other information, can models better learn physical laws?

Wang Qian

First, you definitely can't rely entirely on vision. Generally speaking, vision can obtain much more information than in the general sense. But experiments show that if a person is anesthetized and deprived of tactile sensation, generally they can still do everything — just very poorly, with success rates becoming very low. Through learning, success rates can recover, but there's still a very large gap compared to having tactile sensation. We believe tactile sensation is extremely important in some sense, but not so necessary.

Zhou Guang

For example, human pain sensation — people have pain sensation from childhood, and then form pain-related neural pathways. If a person loses pain sensation due to disease, they can still live a long time afterward. But someone born without pain sensation finds it very hard to live long.

Wang Qian

Video pre-training is very important — we can obtain massive amounts of information for pre-training. This is absolutely necessary, and it's free data. If you can't even make full use of free data, what makes you think you can afford that expensive (simulation) data? That definitely doesn't make sense.

Second, for robotic manipulation, there ultimately still needs to be extensive training with familiar tactile sensation before it can approach or surpass human-level performance.

Chen Yu

Is the language modality necessary? Everyone's talking about the VLA concept, but there are also voices saying the language modality may not be necessary. What do you think?

Zhou Guang

Theoretically speaking, language is an ability that can help you learn quickly during the learning process. You can understand it this way: if you learn to drive just by watching your coach every day, without knowing whether you stopped at this intersection because of a red light or because of a person. Language is very good supervisory information, telling you that this time you stopped because of a red light, next time because of a pedestrian. This information helps you converge quickly — language is extremely important during training.

Additionally, during inference it's no longer a black box, which enhances users' sense of reassurance — this is very important. Although when driving you don't constantly need to interact through language, during training, the supervisory information carried by language is crucial. Without such semantic guidance, the model's learning will struggle to converge effectively. Therefore, language's core role lies in providing clear behavioral guidance during the training phase, and enhancing the system's understandability and trustworthiness during deployment.

Wang Qian

Embodied intelligence and autonomous driving are still quite different. On a map, you tap a location and the car drives there — that's it. Humanoid robots need to talk, and they need to talk closely and in detail. Whether the language modality is needed or not, I don't think there's any controversy. Language is absolutely necessary.

Where there's more debate in humanoids right now is this: Do some single-point vertical scenarios need dedicated embodied intelligence models? My view is that it's probably more reasonable to have a unified embodied foundation model with language capabilities, then distill a smaller model for a specific scenario.

Since this generation of large models emerged, we haven't seen any specialized model surpass the capability ceiling of a generalist operating model. If we're pursuing ultimate performance, we definitely need to build the general model first, then extract the specialized parts from it. That's the right approach.

Second, why is language so important today? Because it's essentially the core of how we've been training multimodal models. When people first started working on multimodal models, there was debate: should language be the core, or vision? Now the results that have proven most effective all use language as the core.

Leveraging the "legacy" of existing multimodal models — we can't bypass this. We definitely need to build on language-centric multimodal models as the foundation, then explore how to apply them to embodied domains.

Looking ahead, language may not necessarily occupy the core position in a future unified physical-world foundation model. "Speaking" generally doesn't match the scale of physical action or spatial dimensions. Language struggles to describe very short-duration processes and very fine-grained spatial positional relationships.

For example, unscrewing a water bottle — it's hard to describe the trajectory in language, which direction to apply force, because the time is so short and the spatial difference is just a few degrees. You can't convey that through words.

From this perspective, future physical-world foundation models will need to move beyond language to some degree. But the language modality will always be here. Humans still need to talk to robots, not just for emotional needs but also for task execution — speech is the interface. So integrating language into the overall model architecture is a natural choice.

Chen Yu

You're convinced that a physical foundation model will ultimately be trained into existence. But this differs from large language models — collecting corpus for LLMs is relatively cheap, whereas a physical foundation model would need to exhaustively cover all possible actions, see all objects and their materials, and so on. The cost here would be extraordinarily high.

Wang Qian

Go back 15 years, when we first started working on AI. Or even further, say 100 years, when the concept of AI first emerged. Why did people believe AI could be built? Fundamentally, because there was already an intelligent system right in front of us — humans. If we accept that the world is materialist, that the human brain is essentially a somewhat larger neural network, there's no reason we couldn't train one.

Returning to whether we can train a physical-world model: on one hand, I think we already have a rough sense of the total data volume needed. This was kept confidential internally before, but now some peers, like Generalist, have published scaling law predictions for the embodied domain that are very close to our own estimates.

Based on these projections, it's feasible to collect sufficient data within a reasonable timeframe and resource investment to train a true foundation model. After that, you can collect data from robots already deployed in the real world. At minimum, the cold-start phase is entirely within a controllable range. We have ample confidence in this.

Human pre-training doesn't require that much data. From birth to age ten, it's basically tens of thousands to hundreds of thousands of hours. But you can't train a human-like system on hundreds of thousands of hours because the training mechanisms are different, and the data properties are different too.

Some might say you couldn't do it even by exhausting Earth's resources. But the information is definitely sufficient — we just need more clever methods to utilize it. Even if we don't have them now, we have decent estimates and relatively strong confidence that we'll reach a certain level at some point in the future. I believe every company has their own estimates, and this is why people are so committed to embodied intelligence — everyone has made their own informed judgment.

"Commercialization Has Begun"

But How Many Paths Can Actually Work?

Chen Yu

We've discussed a lot of technical topics. Now let's turn to another subject people are interested in — scale and commercialization. Autonomous driving has developed extremely rapidly over the past two years. A few years ago, you'd rarely see mass-produced vehicles with advanced assisted driving; now it's gradually becoming standard in new cars priced above 100,000 RMB. How do you view these industry changes? For DeepRoute, for example, going from a single lab vehicle to small-batch production, to 200,000 vehicles today, to million-scale next year — what's the core capability that underpins this entire mass production process? And how do you ensure this can be replicated at scale?

Zhou Guang

Early autonomous driving technology was primarily based on traditional approaches without scale — building high-definition maps and writing rules. That was relatively simple, and to this day many people still use very traditional methods.

In 2024, I made an estimate: you'd need about 10,000 vehicles to do end-to-end well, and 100,000 for VLA. That looks fairly accurate in retrospect. Technology deployment needs to be gradual. Initially we built end-to-end foundational capabilities, achieved mass production, and gradually formed a healthy commercial闭环. This doesn't mean replacing humans overnight. As vehicle scale grew from 10,000 to 100,000, the system could introduce the language modality to enhance learning capabilities and further optimize model performance.

There's substantial work in between, especially at the engineering level. Managing tens of thousands of vehicles requires solving physical device management issues — a process involving大量繁琐但关键的工作, including data mining, sample selection, and quality verification.

At the current 200,000-vehicle scale, the daily data generation is massive. We need to carefully select based on model capacity rather than blindly pursuing learning capacity. Meanwhile, constraints around power consumption, training efficiency, and parameter scale require engineering to prioritize low-hanging fruit. We've clearly felt that at the 100,000-vehicle scale, data diversity is already sufficient to support basic perception and decision-making, but further performance improvements require introducing the language modality.

Chen Yu

The embodied intelligence industry right now somewhat resembles autonomous driving ten years ago — everyone's still at the demo stage, with neither technology nor use cases converged.

Zhou Guang

The technical path is different — we won't use the same approach from ten years ago.

Chen Yu

We've looked at many companies, and their paths aren't that unified. Autonomous driving went through many technical paradigm shifts too — from modular to end-to-end. The current embodied intelligence landscape is similarly diverse. On commercialization, different companies have different considerations. Some try to commercialize before their technology is fully mature, wanting to enter capital markets quickly, while Ziliangliang focuses more on foundational model research. I'd like to ask Wang Qian: how do you view the commercialization timeline for the embodied intelligence industry? When do you think this industry will truly be ready for scalable commercialization?

Wang Qian

I've been thinking about this deeply lately. As the year ends, many people ask: "What were you doing two years ago?" Thinking back, almost all primary market investors and most entrepreneurs were rushing to pick a vertical scenario to落地, to get it running in that vertical, to have positive cash flow, a good cycle, and rapid growth. From the perspective of that moment in time, that thinking was understandable.

But now, many of those same investors have come back to me saying, "I finally understand you were right. Foundation models are what matter. Going after commercialization too early was a waste of time and resources." I hear this from investors all the time today. In a sense, we still need to pursue truth — to care about how things ought to work, or what their objective developmental patterns look like. Objective patterns can't be swayed by human will.

Two years ago, without foundation models, trying to build just a single vertical scenario — it simply didn't work. I'd said many times: if it were really doable, someone would have done it in the past eighty years. Professor Zhang's speech just now essentially summarized the pinnacle of what everyone had achieved before foundation models existed.

Coming back to today, foundation model development has reached a tipping point. 2026 should be a landmark year for embodied intelligence commercialization.

In 2023–2024, what companies could offer the market — consumers, clients — was basically only emotional value, or platform value, or resource-swap value. There wasn't a single scenario where an embodied intelligence company was delivering value at the level of actual utility. Is that commercialization? Sure, it's commercialization. It can generate enough revenue to go public. But whether it's sustainable, or can have truly significant lasting impact — personally, I'm still not convinced.

In 2026, I believe we'll start seeing batches of scenarios with genuinely positive ROI, embodied robots that deliver value beyond what robots have traditionally offered. We've been crying wolf for two years — this time, the wolf might actually be here.**

Chen Yu

What scenarios do you think are most likely?

Wang Qian

There are actually two categories. One is what 1X, or something like Sunday, has demonstrated.

Zhou Guang

Right, 1X I think is interesting — they're exploiting global labor wage differentials to acquire data.

Wang Qian

Theoretically, what they're trying to do isn't simple labor arbitrage.

Zhou Guang

It's about completing the data logic loop, and solving privacy issues.

Wang Qian

Embodied intelligence won't instantly become something on the scale of phones or cars today. Early penetration will definitely start appearing next year — between the US and Mexico, or Japan and Southeast Asia, or Europe and Turkey. Labor gaps like that are enough to support real commercial value.

Zhou Guang

And it simultaneously solves the problem of authentic data sources for the technical model.

Wang Qian

I do want to add one caution: embodied intelligence isn't easy. This remains an extremely difficult problem.

Zhou Guang

Right, and not just on the technical level.

Wang Qian

It's hard on every dimension, and the technical side alone isn't simple. What people are generally doing today is applying reinforcement learning for post-training on simple tasks. Recently there have been some notable advances on the RL side — though honestly, these are things everyone was playing with ten years ago. Actually applying them to foundation models, training at this scale, still involves substantial engineering challenges that are gradually being resolved. Right now there are many things that are trivially easy for humans — one motion, two motions — but were completely impossible for previous generations of robots. For this batch of scenarios, we can at least achieve full autonomy, and it's not that far off — roughly 2026.

Physical AGI:

A 3–5 Year Window, or Further Out?

Chen Yu

We're looking forward to the commercial explosion of embodied intelligence in 2026. In the large model space, people discuss AGI constantly, but the "AGI moment" for physical AI is rarely talked about. What do people think the AGI moment for physical AI means? Can the current technical path actually get us there?

Zhou Guang

For embodied intelligence, I think mobility is relatively clear. With this wave of autonomous driving technology, the foundation model for mobility capability will converge fairly quickly. Right now there's a regulatory framework for hundred-kilometer autonomous driving tests. If you take that outside the specific scenario of driving on roads and apply it to robotics scenarios, it becomes commercially viable.

Wang Qian

Our estimate is within a 3–5 year cycle. There's a Scaling Law projection, and based on what this path can deliver, that's basically the timeline. Not particularly far — not 8–10 years.

Zhou Guang

Yes, autonomous driving and embodied intelligence won't need that long.

Wang Qian

At the end of the day, there's always an element of faith here — whether you believe in Scaling Law, or whether you believe Scaling Law applies to robotics and vehicles.

Zhou Guang

One more thing — I think chips are also an issue. The reason everyone's building their first-generation VLA today, including FSD V14 which uses a similar architecture, is that compute power is a problem you can't ignore. Going from 200-something TOPS to 1000 TOPS — I still don't think that's enough. A model with around 1B parameters still struggles to do complex work.

Chen Yu

Wang Qian just mentioned that we already have the human brain as a template for intelligent systems. But the human brain only consumes about 20 watts — it doesn't have this kind of massive compute power, and we can't just stack compute infinitely.

Zhou Guang

I think for routine tasks on edge devices, the compute is sufficient.

Wang Qian

Parameter scale has something to do with task difficulty, but only to a limited extent. Is driving really easier than IMO (Interactive Manipulation Operations based on vision)? I don't think so. I can say with confidence that manipulation is not easier than IMO.

Zhou Guang

But biology took a billion years of evolution to get to where it is now.

Wang Qian

What really drives parameter count is the total amount of knowledge. Why are language models so large? They have to memorize everything from all of Wikipedia to all the knowledge on the internet. With that much information encoded, they need that many parameters. There's no way around it.

Whether it's driving or embodied manipulation, you don't need that much knowledge. Most of it is common sense, which compresses relatively well. So I definitely agree with Zhou Guang — it'll be much smaller than language models.

Zhou Guang

Whether you're driving or building an app, you need a lot of real-time tokens to describe the full behavior — that requires a certain amount of compute. It's not just a memory problem.

Chen Yu

Unlike large language models, embodied intelligence is fundamentally a real-time system. When an LLM solves a problem, if it fails once it can try 5 paths, 10 paths, even spend an hour thinking. But embodied intelligence needs immediate feedback.

Thank you both for a conversation that was as technically rigorous as it was imaginative. I hope everyone here has gained a real sense of the challenges and new possibilities in physical AI from their work. Thank you all!