Every company in the embodied intelligence space could be solid — the key is whether they've figured out who they want to become | Linear Voice

线性资本·February 2, 2026·47·0

The embodied data foundation needs to be at the scale of 10 million hours or more.

In November 2024, Chen Yilun, former chief scientist at Huawei's intelligent vehicle business unit, threw himself into embodied intelligence entrepreneurship. Soon after, Tashi Intelligence emerged — co-founded by Chen and Li Zhenyu, former head of Baidu's autonomous driving division, among others.

Less than six months after its founding, Tashi completed $120 million in angel funding and $122 million in angel+ funding, setting a record for angel-round financing in China's embodied intelligence sector. Linear Capital, as its earliest-stage investor, backed both rounds consecutively.

In a recent interview with LatePost, Chen shared many seemingly "heretical" strategic choices and the technical thinking behind them. On the embodied data that the market cares about most, he believes the required data foundation is 10 million hours or more. Below are highlights from Dr. Chen Yilun's conversation with LatePost.

Since its founding, Tashi Intelligence has attracted intense market attention.

Yet in his recent interview with LatePost, founder Chen Yilun revealed somewhat "heretical" technical thinking, sharing many counter-mainstream judgments. For example:

He considers VLA (Vision-Language-Action), the mainstream architecture for current embodied models, as "retinal information," while Tashi has developed AWE (AI World Engine), pursuing the expression of time, space, force, environmental interaction, and other physical quantities — what he calls "world information."
The mainstream VLA approach goes from LLM (Large Language Model) to VLM (Vision-Language Model), then trains VLA on top of VLM. Chen strongly disagrees with this. He believes embodied intelligence will necessarily have its own independent model, rather than growing an action "head" on top of a VLM.
When data stands as the core bottleneck for embodied intelligence today, Chen chose not to do teleoperation data collection at the start of his venture, instead self-developing wearable devices for data collection.

The courage to be different may come from his firsthand experience as one of the earliest to attempt end-to-end autonomous driving. Industry insiders have told LatePost that Chen, who kept a low public profile during his Huawei tenure, was the technical "soul" of Huawei's intelligent driving R&D team.

This raises a question: Will the technical evolution of intelligent driving rhyme with embodied intelligence? Both fall under the umbrella of physical AI, but large language models have brought an entirely new technical environment. There's no ready answer — it depends on different bets by different practitioners. Chen tells us his.

▍You worked on drones at DJI and autonomous vehicles at Huawei — both branches of robotics. When did you first become interested in robots?

Chen Yilun: It started during school. I was admitted to Tsinghua University through the physics olympiad, but majored in electronic engineering. When I went to the US for my machine learning PhD, I envied my roommate in mechanical engineering — the things they built could move. I've always been drawn to things that "move."

In 2007 in the US, I saw Boston Dynamics' hydraulically powered quadruped robot. It could maintain balance even after slipping on ice. It was stunning. After my PhD, I didn't follow the most mainstream path for AI people and join a big internet company. Instead, I went to a very well-known electromechanical systems company, where I learned how to build motors, servo controls, and hydraulic systems — because at the time I believed robots should be hydraulically driven. I also led a hydraulic servo control product line at my first company.

So throughout my career, I've held this belief: someday I'll build my ideal robot. But as someone from an algorithms background, I used to think the technology wasn't ready yet. I could only write simple programs. That wasn't the robot I wanted.

▍When did you realize the technical conditions had matured?

Chen Yilun: Around 2020–2021, when I first tried end-to-end systems at Huawei. By then I'd been leading R&D at Huawei for two or three years. Our autonomous driving system had at least 2 million lines of code. It worked — it could handle complex urban road conditions — but maintenance costs were extremely high.

In 2020, Ding Wenchao (note: Tashi's chief scientist, recruited into Huawei's autonomous driving division through its "Genius Youth" program) and other colleagues wanted to try something: Could we train a neural network to replace those 2 million lines of code? We eventually used 30,000 lines of code to train a network that directly planned the vehicle's trajectory. That was the earliest end-to-end autonomous driving, though what we did at the time was "two-stage" (note: perception was one end-to-end network, planning and control was another).

▍Did your 2020 end-to-end work draw inspiration from industry signals like Tesla's AI Day?

Chen Yilun: No. Tesla's 2020 AI Day hadn't discussed end-to-end yet. It covered how to recover 3D environments in perception — BEV (Bird's-Eye-View). For us, perception was a solved problem. It's an open problem: with data and labels, you can do it.

What gave me the most headaches was planning and control. This is closed-loop AI: every action you produce affects the next moment's environment. If you choose to "cut in," the other car might yield, or it might accelerate to block you. How do you train this closed-loop AI? Nobody was confident at the time. But with traditional rule-based methods describing corner cases one by one, code had ballooned to 2 million lines, and the speed of discovering problems had far outpaced the speed of solving them. So a new approach was necessary.

▍How did you specifically explore end-to-end?

Chen Yilun: We needed large-scale collection of human driving data — something nobody had done before. We mobilized a fleet of about 100 vehicles dedicated to this. Dr. Ding (Ding Wenchao) was on site every day teaching drivers how to drive, defining what constitutes "good driver" behavior.

At first there wasn't significant progress. But when data accumulated to several thousand hours, you could see the network was really learning things, and getting better and better. We chose an extremely difficult test scenario — a completely unstructured urban village with mixed pedestrian and vehicle traffic, nearly impossible to navigate with rule-based algorithms. We boldly tried a neural network, with the principle of "the less post-processing, the better." The vehicle navigated through smoothly. That moment was my "GPT Moment." I realized AI could do planning.

▍Why did you leave Huawei and join Tsinghua University's Institute for AI Industry Research (AIR) not long after? Intelligent driving was on the cusp of scaling up and qualitative change.

Chen Yilun: Because I'd always wanted to build robots, and the success of end-to-end showed me that the inflection point for accelerated robot development was approaching. But I still didn't know exactly how to do it, so I chose to return to academia first and give myself time to research.

▍From joining Tsinghua to starting Tashi in late 2024, what changes in general-purpose robotics made you feel the time was right to start a company?

Chen Yilun: I saw three dawns. First, the unlocking of locomotion. Around 2020, ETH Zurich (Swiss Federal Institute of Technology) pioneered a path: using reinforcement learning (RL) to solve quadruped robot control, replacing the previously used, extremely complex WBC (Whole-Body Control) that made robot movements stiff and mechanical.

This involved two core modules: first, high-concurrency simulators. The shift in computational infrastructure for simulation from CPU to GPU dramatically increased concurrency, enabling much more data. Second, narrowing the "Sim-to-Real Gap" — the gap between digital and physical worlds. Companies like Unitree that excel at hardware and motion control have core competence in narrowing this gap through various methods. That's why we now see robots moving and dancing fluidly.

The second dawn was GPT and large language models, which provided task planning capabilities that were previously the hardest problem in robotics. Autonomous driving task planning is relatively simple: point A to point B, with ready-made navigation data like maps. But robot tasks are far more complex, and entering homes or factories lacks data. GPT excels at task planning.

The third was end-to-end, which I had personally validated. Essentially, all robot task logic is: input sensor information and commands, output actions. But sensor data is extremely high-dimensional, while commands are extremely low-dimensional. The past way of bridging them was writing rules. Rules were already struggling to cover all corner cases in autonomous driving; they're impossible for robotics. So the insight that end-to-end can work is crucial.

▍Several concepts frequently appear together in physical AI fields like autonomous driving and embodied intelligence: end-to-end, VLA, world models. How do you understand and distinguish them?

Chen Yilun: The essence of end-to-end is using neural networks to solve as much as possible. Whether the underlying method is imitation learning or reinforcement learning, these are optional choices.

VLA (Vision-Language-Action model) is a neural network that inputs visual and language information and outputs robot actions. How it's trained in the middle is also understood differently by different people now.

World model has even more definitions, but from an information theory perspective it's simple: input the current state, generate the next state. This state can be expressed through 3D information, video, or changes in physical interaction. So when people say "world model," some mean 3D generation, some mean video generation, some mean understanding physical interaction. The applications vary wildly too — some for metaverses or games, some for embodied intelligence and robotics.

▍While you believe some conditions have matured, overall embodied intelligence progress lags far behind large language models. What's the bottleneck?

Chen Yilun: I believe AI needs to cross three walls to solve a large, complex problem.

The first is the data wall. Only sufficient data volume can support sufficiently complex networks. Large language models are blessed in this regard — the internet already has massive corpora. Obtaining data for embodied intelligence is hard and expensive.

The second is the compute wall. Why not algorithms? Because more complex systems tend to have simpler algorithmic structures, which can withstand massive data. So once you enter the pre-training scaling phase, the difference isn't in algorithms but in compute competition.

Then, when the marginal returns from scaling compute diminish or compute itself becomes insufficient, you hit the third wall and enter post-training. At this stage you can no longer rely on throwing resources at the problem. You need to find elegant solutions for specific problems. This will be a highly creative phase.

Right now, large language models and autonomous driving have both moved past the second phase, while embodied intelligence is still stuck at the first wall: data. The core pain point for embodied intelligence today is how to obtain high-quality data at low cost and massive scale. Once the data problem is solved, the industry will reap enormous dividends, and intelligent capabilities will advance by leaps and bounds.

▍It sounds like you're not worried about how to design algorithms and models for embodied intelligence?

Chen Yilun: First, without data, you're powerless when it comes to algorithms. At the same time, neural network algorithms are fundamentally different from traditional ones. Traditional algorithms require careful design, but a neural network is essentially a function. What matters most is defining the inputs and outputs — much of the design lies outside the algorithm itself: how to maximize compute utilization, how to minimize data acquisition costs.

▍But looking at the development of large language models, massive internet data already existed long before the Transformer architecture emerged, and later from BERT to GPT, the field only then saw a major turning point. (Note: Both BERT and GPT are large language models based on the Transformer architecture; BERT has both encoder and decoder, while GPT has a simpler structure with only the decoder.)

Chen Yilun: I believe the greatest thing about GPT isn't the architecture itself, but coming up with next-token-prediction as the training task.

In fact, quite early on, Andrej Karpathy — who has worked at both OpenAI and Tesla — wrote a famous technical blog post called The Unreasonable Effectiveness of Recurrent Neural Networks (published in 2015). He showed how a relatively small RNN model, continuously predicting the next character, could write poetry and code. My first reaction upon reading it was: could this logic be applied to autonomous driving? The idea of training complex capabilities through such a simple task is truly remarkable.

▍There wasn't even Transformer back then. (Note: First proposed in 2017.)

Chen Yilun: Right, so regarding model architecture, it's what I said earlier — the heavier the sword, the blunter its edge; the greater the skill, the simpler the craft. The more complex the task and the more massive the data, the simpler and more fundamental the network structure needs to be.

GPT is exactly like this. It shows no obvious advantage on small datasets, but with more data, everyone converges toward it.

▍If the success of large language models was defining the goal of "predict the next token," then what's a good training objective in the embodied intelligence domain?

Chen Yilun: That's an excellent question. Autonomous driving offers two very valuable insights for embodied intelligence. First, BEV (bird's-eye view), introduced at Tesla AI Day in 2020 — its essence is spatial reconstruction. Many people today are doing end-to-end through VLA, but no matter how much language you introduce, you can't escape spatial reconstruction.

Thinking from a more fundamental perspective, what representation is better? The most classical physical representation is the best. You can understand the world through images, where each pixel is a color value — looking at a physical entity from different angles yields many combinations, but it's still the same entity with spatiotemporal concepts, occupying certain time and space; when it moves, it has mechanics, and mechanics guide what state it becomes next. This physical representation is far more compact than RGB because it's more fundamental. If neural networks can learn these physical properties, many tasks become much easier.

This (spatial reconstruction) is unique to Physical AI — it has nothing to do with large language models.

The second training objective is interaction with the world. This is harder for robots than for autonomous driving, because autonomous driving is a non-collision system, while robotics is a contact system. It applies force to manipulated objects — operating flexible materials like cloth or wire harnesses is particularly difficult.

▍So many embodied intelligence companies demonstrate their technical capabilities by folding clothes, rolling socks, or organizing napkins — you showcased embroidery.

Chen Yilun: Yes. Simply grasping and placing rigid objects like metal parts neatly — that was solved long ago. The mission of this generation of robots is to accomplish tasks that previous technology couldn't do.

Tashi Robotics demonstrating embroidery of the Tashi logo

▍To summarize, you believe the two important training objectives for embodied intelligence are spatial reconstruction and interaction with the world. If these are truly achieved, what kind of emergent intelligence might appear in embodied intelligence, similar to what happened with large language models?

Chen Yilun: The essence of emergence is interpolation. Large language models appear intelligent because, faced with a prompt, they trace back to similar fragments in massive data and generate new combinations — they don't "truly understand." Embodied intelligence is the same now, but can already demonstrate remarkable effects.

▍So the surface-level "emergence" isn't true generalization?

Chen Yilun: This methodology is generalizable. Although pre-training itself doesn't make a model "truly understand," you can enhance capabilities in a vertical domain by supplementing data. For example, one application direction for large language models is coding — you need to feed it various code data. FSD is another example: it runs well in the US, but can't immediately drive well in China, Japan, or other regions. However, you can improve performance by adding relatively small amounts of local data.

Robotics works the same way. As foundation models become more capable, you can adapt to diverse tasks by adding task-specific data. The amount of supplementary data needed during deployment doesn't have to be that large.

▍This approach might enable commercial applications in some scenarios, but it still can't learn new tasks as quickly as humans can.

Chen Yilun: You're right. The current approach is still relatively heavy — it's essentially crazy data generators and data simulators. Humans actively use prior knowledge to efficiently find the data they need and absorb and learn from it. For example, Ilya recently shared that humans rely on some mechanism to imagine results and get feedback before starting a task or midway through — possibly mediated by emotion. Before we start something, we often feel fear or excitement. But machine reinforcement learning doesn't work this way: it has to traverse all possible solutions and only gets reward after completing a task. (Note: Ilya is former Chief Scientist at OpenAI and founder of Safe Superintelligence. In November 2025, Ilya mentioned this idea in an interview with Dwarkesh.)

So if this problem could truly be solved — learning new tasks like humans do — it would be extraordinarily impactful, improving AI learning efficiency by orders of magnitude. But at the current stage, what everyone has found that shows powerful results is still this data generation and fitting methodology.

▍Let's talk about how Tashi specifically approaches data and models. Your "Human-centric" data engine consists of lightweight gloves plus first-person cameras as a capture device, worn by people while they work. How did you come up with this approach?

Chen Yilun: I figured out the data problem before starting the company. The first pitch deck in 2024 laid out this approach, but it faced heavy skepticism. At the time, Tesla Optimus and Physical Intelligence were using teleoperation — having humans control robots to collect full data. But it's expensive and slow, making it hard to reach the foundational scale of embodied data.

Tashi's self-developed data collection kit SenseHub, consisting of gloves (available in five-finger and two-finger versions) and first-person cameras

▍What is the foundational scale for embodied data?

Chen Yilun: 10 million hours or more. Autonomous driving systems need roughly 1 million hours of data to achieve continuous usability; embodied intelligence has higher complexity, requiring an order of magnitude more data.

▍Can't simulation or learning from video data also enable low-cost, large-scale data acquisition? Some companies like GalaxyBot and Hillbot emphasize simulation data.

Chen Yilun: These are pitfalls we've already fallen into.

Let's start with internet video data. When working on autonomous driving, we scraped a lot of driving videos from YouTube. But first, the volume isn't actually that large, and second, most of these videos show cars driving normally — they don't match the driving problems we needed to solve, and can't establish an "instruction-action" mapping. So many teams in this direction later abandoned it. It's the same with robotics.

Simulation can render images very realistically and solve perception, but it's not very useful for fine manipulation. The only exception is locomotion simulation, because it doesn't need to deal with complex environments.

▍So beyond data volume, figuring out what types of data are useful is also critical.

Chen Yilun: Right. Data is primary in the embodied domain — what algorithms to develop later must also match the data type.

Overall, embodied data comes from two sources: from humans, and from the world. Data from humans is more direct and faster. And data that records human behavior is essentially sensor data. So the question becomes: how should sensors be designed to naturally, cheaply, and at scale capture human behavioral data? And this data should be of humans performing real actions in real scenarios.

▍Teleoperation is expensive, but it's real robot data — isn't that also real scenarios and real actions?

Chen Yilun: Actually, teleoperation mostly can't achieve real scenarios, because teleoperated robots currently can't work as flexibly as humans, and they interfere with others' work, so it's very difficult for them to enter real factories, cafés, or homes.

Teleoperated motions aren't fully authentic either, because operators switch between task types — they can't work like the skilled professionals who actually inhabit these environments.

▍What do you think of companies building large data-collection factories, mass-producing robots, and using teleoperation to gather data?

Chen Yilun: Back in the day, autonomous driving had its version too — people spent fortunes building test tracks, simulating all kinds of road conditions, like miniature worlds. But neural networks trained by driving around those places couldn't simply be deployed on real roads.

It's the same with embodied AI: if a robot only operates in artificially designed environments, it will fail the moment it leaves them.

▍Does your data collection method have any drawbacks?

Chen Yilun: Our approach is more efficient, the data is more authentic, and it's easier to scale. We haven't found any flaws in the architecture or functional design. But it demands more from AI capabilities.

▍What are Tashi's actual collection volumes and growth rate right now?

Chen Yilun: Very fast. We started large-scale collection around August or September 2025, and we already have roughly 100,000 hours of data. We've used many methods to compress costs, so we can begin scaling now. Next year, our data volume will explode by many multiples.

▍How low are the costs? How much lower than teleoperation?

Chen Yilun: At least two orders of magnitude lower — that is, 1/100th. Teleoperation requires deploying a bunch of expensive robots that move slowly, have low success rates, and you still need to hire operators to collect data. Our approach lets us partner with scene providers: workers just put on gloves and work, without disrupting production. Our biggest costs are actually compute, and building the pipeline that turns raw data into training-ready data for neural networks.

Data collected while a supermarket worker wears Tashi's self-developed data-collection device to restock shelves

▍What dimensions of data can Tashi's wearable device capture?

Chen Yilun: It fully characterizes hand motion — including the hand's pose in space, meaning position and orientation; each finger's pose; and the force applied to objects during manipulation.

▍Don't you need additional sensors on the arm to capture arm movements?

Chen Yilun: No. We aim for maximally unobtrusive, passive collection. The glove needs to be light and wireless.

▍Is pose obtained through a first-person camera worn simultaneously?

Chen Yilun: It's not simply vision-based. For example, when someone folds a quilt, their hands are inside the quilt — you can't see where they are, yet you still complete the task. In short, we have a series of design innovations, which is why we build our own hardware.

▍Weren't there already glove-based capture devices available on the market?

Chen Yilun: Not in robotics. Other fields have things that look similar, but none were designed for embodied AI. VR controllers, for instance, mainly rely on cameras in the headset for positioning. But that data quality isn't sufficient for embodied applications — it lacks depth information and can't work in dim lighting. Motion-capture gloves for the film industry don't have enough precision.

▍Sunday Robotics released its skill capture glove in November 2025. How is their approach similar to or different from yours?

Chen Yilun: Whether to build a glove at all depends on your vision for the robot's end-effector. I'm a firm believer in dexterous hands — I believe the ultimate manipulation terminal will definitely be dexterous hands — so you need a matching sensor, namely a glove. And gloves generalize very easily: they can do various tasks and collect diverse data.

Under this approach, different teams are at different stages of implementation, because glove industrial design is very difficult. We've built a five-finger glove that captures full information. Sunday built a three-finger glove — a lower-dimensional, reduced-DOF version. An even simpler approach is having people hold a gripper to manipulate tasks.

▍Some investors believe that while China has many embodied AI companies, it hasn't made truly leading contributions — for example, the VLA model was pioneered by Google's RT-2, and the wearable glove approach for data collection is seen as led by American teams like Sunday.

Chen Yilun: People should have more confidence in Chinese technology. I have many friends working in robotics in the US, and they're feeling real pressure seeing China's progress now.

Because embodied AI is a tight intertwining of hardware, data, and algorithms — if you want to build a great model, you need to know what data you need, what sensors to use, how to collect it, and what actuators it will eventually run on. China has powerful industrial manufacturing capabilities, plus so much AI talent and engineering capacity, allowing better integration and co-optimization of these elements. In the era of embodied AI, American entrepreneurs won't be a match for Chinese entrepreneurs.

▍Your model is called AWE, AI World Engine. The name suggests it's not the industry-mainstream VLA (Vision-Language-Action) model.

Chen Yilun: Right. AWE first pursues deep representation of the physical world. We devote the most compute to recording physical quantities like time, space, and force — what you might call "world information" — rather than doing "retina-style" representation like VLMs do. This world information also records how robots interact with objects — for example, when you squeeze an object, how it responds.

Second, why call it an engine? You could say it's a model. But "engine" emphasizes that it's dynamically evolving: when the robot's action changes, it can predict the world's next state and recommend what the robot should do next.

▍Why not build the more mainstream VLA?

Chen Yilun: Before starting up, I asked myself whether robotics deserves its own foundation model. If you think a robotics model is just growing an "action head" on top of a VLM multimodal model, then robotics is merely a downstream branch of other industries, and this field can't exist independently.

▍The current mainstream VLA approach, simply put, is: first get an LLM, then get a VLM (vision-language model), then build VLA on top of that VLM.

Chen Yilun: Yes, I strongly disagree with this. Current multimodal models are mostly supported by "look at the picture and talk" QA data. Merely looking at pictures and talking can never teach a robot how to do things in the world. Robotics will definitely need its own model architecture.

▍Many embodied AI practitioners now say that embodied intelligence hasn't found its "Scaling Law" yet. First, what's your view on this assessment? And when do you think it will arrive?

Chen Yilun: I think embodied AI is already scaling. People generally judge scaling by two criteria: whether performance has reached a certain state, and whether the growth trend is there. If you look at the trend, we're unquestionably in a scaling state now. But it will take more time for this to clearly show up in model performance.

Scaling Law starts slower because, as mentioned earlier, it needs to sequentially go through data walls, compute walls, and environmental interaction. In 2025, the embodied AI industry is seriously working on data. By 2027, or even 2026, there will definitely be results.

▍What early signals will there be?

Chen Yilun: The industry will shift from chasing viral video demos to solving specific problems in vertical domains. Overall industry confidence will keep rising. A small number of teams will genuinely create value in specific scenarios — for example, real procurement from major customers. On average, the industry will demonstrate stronger embodied intelligence capabilities.

▍What scenarios will Tashi focus on for deployment?

Chen Yilun: The consumer market still needs time. Our first wave will enter industrial manufacturing, such as wire harness assembly. Wherever there's electricity, there are wires — cars, home appliances, servers are full of them. Wire routing, plugging, and assembly are extremely difficult for traditional robots because wire harnesses are three-dimensional and flexible. These high-technical-barrier domains are exactly our opportunity.

▍Final question: with so many companies in embodied AI now, how do you judge who's credible?

Chen Yilun: Everyone might be credible. The key is whether they've clearly thought through who they want to become. We're very clear about what we want to do, so we'll keep running in the right way.

This article is republished with permission from LatePost AI*, author: Manqi Cheng, editor: Wei Song*

Original title: "Dialogue with Tashi Intelligent Navigation's Chen Yilun: No VLA, No Simulation — An Embodied AI Company's Unconventional Bets"