FreeS Fund's Li Feng: For Embodied AI to Become Reality, It May Need to Copy These Three Playbooks

峰瑞资本峰瑞资本·May 9, 2026

Large language models, autonomous driving, and AlphaFold.

No peak heat, only more heat.

The embodied intelligence space never runs short of new stories.

Just in the past month alone, NVIDIA released Cosmos, a physics AI model; Alibaba unveiled HappyOyster, an open-world model; Tencent open-sourced Hunyuan 3D World Model 2.0; and World Labs, founded by "AI godmother" Fei-Fei Li, released its Marble 1.1 series in April, focused on large-scale 3D scene generation.

World models, UMI (Universal Manipulation Interface), physics simulation — concepts that once lived mainly in academic papers have become buzzwords in industry discourse.

On April 23, at the 20th ChinaVenture Annual Investment Summit, Li Feng, founding partner of FreeS Fund, put it bluntly: generating high-dimensional data, world models, and physics models are the three hottest new investment directions in embodied intelligence right now.

In his view, the root cause behind these new concepts and phenomena is a lack of data — "specifically, humanity has never accumulated large quantities of data containing these physical quantities and the laws governing physical-world interaction. We've never produced this kind of data at scale."

And the market's enthusiasm for generating high-dimensional data, world models, and physics models all boils down to solving the same problem — how to still get things done, how to accomplish various robotic manipulation tasks, even in the absence of such data.

Below is a transcript of Li Feng's remarks, edited by ChinaVenture with minor adjustments for republication —

First, congratulations to ChinaVenture on its 20th annual summit, and thank you for the invitation. Being an investor isn't easy these days — the market throws out new concepts by the day, and we have to keep updating our knowledge and learning continuously.

What I'm sharing today is just our own observations and thoughts. Recently there's been a flood of new developments and concepts, and I've picked out a few to discuss with you.

/ 01 / Three New Things in Embodied Intelligence: UMI, World Models, Physics Models

No robot body needed to collect data: UMI puts a camera on your chest

Since late last year, both in the US and China, we've seen data from body-less robots, called UMI (Universal Manipulation Interface) data.

The emergence of UMI data generation has created many new opportunities and seemingly promising startups offering various data collection devices. You may have seen people hanging cameras from their chests, paired with devices that may or may not include haptic feedback — whether it's teleoperation, mechanical gloves worn on the hand, or simply using bare hands to perform various actions, all fall into this category: generating high-dimensional data.

World models: the story almost every robotics company is telling now

The trendier concept now is world models — today virtually every robotics company mentions this term.

World models attempt to introduce new 3D data, including interaction data where objects are contacted and their states changed, to build a new model that can better understand how humans interact with objects and alter their states.

Currently, numerous emerging companies and well-known firms at home and abroad are entering this space. Abroad, development is already in full swing; domestically, it's just getting started. Look at any embodied intelligence project today, and basically everyone is telling a world model story.

Physics models: sunlight finally reaching the math and physics departments

There's also a slightly more specialized sub-branch within world models called physics models. The basic logic behind it: since robots need to interact with the physical world, why not draw lessons from the physical world's past experience? That past experience is what the industry used to call simulation, or physics simulation.

Physics simulation itself is our mathematical and physical modeling, induction, and computation of physical phenomena that exist in the real world. Today, this small sub-branch takes those previously existing simulation capabilities — whether CAE or CAD — and reintegrates them into models.

The principle is actually quite simple, and the ultimate goal is the same: to process and understand how humans interact with the physical world.

This differs from what people discuss today as large language models: LLMs focus on processing digitally signal-related problems, whether digitized text information, digitized pixels, or video information. The problems to be solved now are: what's the situation with the cup on the table, what happens when it tips over, how to pick it up, how to move it somewhere else — these all fall under what world models need to cover, including that small sub-branch within world models.

What these new stories represent: three directions

We梳理 these topics because each represents a different direction.

The first category is uncontroversial: generating high-dimensional data, mainly combining new data collection methods with new data processing methods.

The second is world models, currently led by people originally from computer science or large models, doing computer vision — whether applying computer vision to facial recognition, autonomous driving, or developing large language model architectures.

The third is physics models. In this small sub-branch, many people with math and physics backgrounds have appeared. One could say that the dawn of investment, or the stories and bubbles of early-stage investment, has finally shifted some sunlight from the computer science department over to fields like mathematics and physics.

/ 02 / Two Major Challenges Behind the Buzz

The buzz conceals challenges.

Language models can't predict the physical world

The first challenge is language models. Without discussing technical architecture, the core issue is this: language models' ability to predict and generate data about the physical world is no longer sufficient.

One major manifestation of this insufficiency is their inability to predict state changes of specific items and objects in the physical world.

One super-large model, or an ensemble of models working together?

Another challenge is the limitation of single models.

Just like with large language models, if the end goal is a single model, it would need to both understand human intent and semantics (knowing what something is), while also predicting and understanding changes in physical quantities — picking something up, pouring water out, knocking something over, judging whether something is heavy or light, what material it's made of, and that material's elasticity, hardness, friction coefficient, and so on.

If one model could handle all of this — understanding intent, recognizing objects, predicting these physical quantities, and how they change after actions — that model would ultimately be far larger and more complex than what we have today.

Because this is an extremely high-dimensional task. Original language models only needed to process digitized information and pixels; now we're trying to predict so many more dimensions. If we still want to use a single model for this, based on human imagination as it currently stands, this would be a super-large model. How much data would ultimately be needed to train it, how complex and computationally and energy-intensive it would be — we don't know today. This is a question without an answer.

Another possible answer is multi-model fusion: converting various physical quantities into simulation-related content and having it interact with some base model. When certain knowledge is needed, call that capability; when a certain physical quantity is needed, call the corresponding model. If so, this would involve extensive cross-referencing, calling, and fusion of models — and how to achieve this cross-referencing, calling, and fusion between models currently has no answer either.

These are the two challenges that have already emerged on top of the three developments mentioned earlier. Whichever development path we choose, we can't avoid them.

The root cause: we don't have this data accumulation

The source of this challenge is actually quite clear. All these phenomena, new startups, and new paradigm directions that everyone sees share the same root cause: no data.

Specifically, humanity has never accumulated large quantities of data containing these physical quantities and laws governing physical-world interaction. We've never produced this kind of data at scale.

So whether it's the first, second, or third entrepreneurial direction mentioned earlier, they all essentially aim to solve the same problem — how to still get things done, how to accomplish various robotic manipulation tasks, even in the absence of such data.

/ 03 / The Triangular Constraint of Embodied Intelligence

If we view the goal of embodied intelligence as a planar coordinate system, there are three different directions that I call the triangular constraint: complexity, success rate, and generalizability.

Specifically:

  1. Complexity: accomplishing particularly complex tasks — not necessarily complex for humans, but very complex for robots. And I'm not talking about locomotion-related tasks, but manipulation tasks, things related to hands.

  2. Generalizability: making one model work on robot type A, robot type B, robot type C, while also adapting to different application scenarios.

  3. Success rate: some scenarios are experimental, some are industrial operations, and then there are people-serving scenarios like haircuts or massages — obviously no one wants a rib broken during a massage or a bald patch from a haircut. This involves success rates across different scenarios.

The reason we梳理 these points is that most demos we see today still fall somewhat short of real-world application. These demos are all striving to prove that the area of this triangle can grow larger, expanding in all three dimensions — or in spatial coordinates, that the volume can increase, expanding in every direction.

Unfortunately, within the limited scope we can currently see, even at the demo level, most projects are only working within the planar coordinate system's triangle, trying to expand one corner or one and a half corners slightly. We haven't clearly seen any method that can pull all three corners outward simultaneously, dramatically increasing the triangle's area. This is roughly the current state of robotic manipulation.

To summarize all the phenomena just discussed: within the already booming entrepreneurial direction of embodied intelligence robots, there are these three new developments. In-depth discussion of these three new things is still limited, but I believe soon, people will begin discussing the two challenges we just mentioned.

What will these new models we discussed today ultimately look like? Will they become single models that are larger, more complex, and beyond our current imagination? Or will we see multiple models calling each other, with how to fuse multiple models still unknown? This is where the challenge lies. And the root cause of these challenges is the lack of sufficient data today. By "lack," I mean the absence of data related to physical-world interaction and physical quantities that's needed to solve these problems.

Furthermore, almost every demo we see today is trying to prove it can expand this triangle, but most companies, at the demo level (not the true application level), can only stretch one corner slightly, and perhaps another half-corner slightly.

/ 04 / Learning from History: Three Paths Already Taken

The most headache-inducing thing about investing is that beyond raising questions, we have to try to find solutions.

We don't have clear solutions, only some historical reference cases.

Large language models: consumed nearly 40 years of internet text

Let's start with the most familiar example: large language models.

From the deep learning boom beginning in 2012, algorithm evolution went through a series of iterations — though the string of algorithm structures or logic iterations starting from convolutional neural networks (CNNs) wasn't on the same path as today's large language models.

Then after 2014, generative adversarial network (GAN) technology emerged. As the technical paradigm evolved further, it eventually converged on the algorithm logic of large language models represented by Transformer. This is the algorithm iteration process. Algorithm iterations, including large models, have never been linear — it's not everyone climbing step by step in order, but climbing two or three steps, then switching angles to climb another two or three steps, then switching angles again.

Let's talk about data sources for large language models. The base models we can train today rely heavily on nearly 40 years of accumulated internet text data. We've used computers for about 30 years, smartphones for about 15 years. During these 40 years, our use of these smart devices has generated an enormous publicly available text database, and this data is the source that has enabled large language models to be trained and achieve their current results.

To add, this is only the text training portion. As mentioned earlier, the embodied models we're trying to train now need to cover more dimensions — 3D space, specific objects, physical quantities, interaction methods, with predictive capabilities desired. These requirements far exceed predicting the next "token"; they're vastly more complex than pure language prediction. Not to mention we haven't yet begun accumulating related data at scale, the way we did with internet text data.

Autonomous driving: first sell you a car, conveniently collect the data

Autonomous driving is somewhat special.

You often see online debates: different companies argue whether autonomous driving today needs to go through the L3 stage? Can it skip L3 and go straight to L4?

Why these debates? Including Tesla, most autonomous driving technology today remains between L3 and L4, and no company can truly claim to have surpassed L4 yet — I'm talking about open roads, not relatively closed environments like ports, mines, or campuses.

But autonomous driving development has also gone from rule-based systems to today's trendiest end-to-end approach (similar to large language model architecture). There's another special aspect: autonomous driving's algorithm iteration hasn't been linear either; it didn't follow one path step by step, but advanced back and forth across several different directions.

Its data source is even more special. Autonomous driving data is mainly obtained by itself.

Take Tesla as an example. Before last year, when most people bought new energy vehicles, whether pure electric or hybrid, they were buying the car itself. In the years before last year or the year before, most people weren't buying cars for autonomous driving, but to save money, for ease of use, quietness, fast acceleration. And when people bought cars, it just so happened that the car came with all sensors installed, because it's a consumer product.

To put it in perspective: when you use smartphones and computers, you're definitely not doing it so any internet giant can collect your images, text, and voice data. But because smartphones and computers are your consumer products, they just happen to come with rear high-definition cameras, microphone arrays, GPS chips — so in using these devices, you generate endless data that internet giants can use, and this data has become the source needed for various models today.

The special thing about autonomous driving is that it first made itself a desirable consumer product that people are willing to buy. When people buy the car, they also buy back all the sensors installed on it, and the data these sensors generate can then be brought back to help autonomous driving technology iterate at scale. Because of this, you'll notice that whoever has more data may advance their autonomous driving technology slightly faster.

But this data isn't purchased from users — it's obtained by selling people a consumer product they need, which just happens to have many sensors installed. These sensors convert driving data, environmental data, in-car driving habits, road conditions, and so on into data needed for training autonomous driving models.

This is historically rare — it's a field that accumulates data for itself, not because it's autonomous driving technology, but because it's first and foremost a car. Over the past decade, people bought cars not for autonomous driving features, but just to buy a car, and sensors came standard on the vehicle.

AlphaFold: when data is insufficient, prior knowledge fills the gap

Finally, let's look at AlphaFold (protein structure prediction). Its three model versions also went through different development processes. Of course, its current algorithm structure is also related to the large models we're discussing today, or in some sense is end-to-end.

In its early development, it needed to leverage large amounts of existing human data, or needed to incorporate some physics models. What are physics models? We're talking about thermodynamics, kinetics. So in the AlphaFold1 and AlphaFold2 stages, much prior human knowledge needed to be added — biological laws, and laws and algorithms related to chemistry and physics.

AlphaFold's data was initially relatively scarce in the AlphaFold1 stage, because it needed very specialized data — it was solving an extremely specific problem: how a protein sequence would ultimately fold, what this long chain would look like once stabilized.

Its data development also went through this process: at first there was only a small amount of protein structure data, so more physics and math models and prior knowledge needed to be added; later there was somewhat more data, so physics, chemistry, and math models and prior knowledge could be reduced slightly; with even more data, these models and knowledge could be reduced further. Of course, this also involved much experiment-related work.

The special aspect of AlphaFold's development path: it didn't accumulate data through consumers, but relied on extremely specialized scientific research data. Yet throughout much of its evolution to today's model, researchers incorporated human prior knowledge, physics models, math models, etc., to help it solve problems during development. Later, with continuous accumulation of new data, plus extensive experimental verification and calibration, it developed into today's AlphaFold3. Today it may need slightly less physics and math models and prior knowledge.

However, it happens to be a model for predicting a single deterministic-dimensional problem, mainly aimed at solving how proteins fold once stabilized. It doesn't need to solve as many dimensional problems as embodied intelligence — state changes, object changes, interactions, various physical quantities, and other complex issues.

In Ten Years, the Answer Likely Lies in One of These Three Paths

The large language models, autonomous driving, and AlphaFold discussed above are the three iterative processes I can currently think of as references.

Large language models used nearly 40 years of data accumulated by all humanity, plus non-linear algorithm iteration, to develop to today's capability of processing language-related logic.

Autonomous driving, from its investment peak in 2015, took ten years to develop to today's L3.5 stage, though it encountered some different challenges along the way. Its algorithm iteration wasn't linear either, and its data was self-obtained — but not because it asked people to help collect data, simply by selling everyone a car that happened to have these sensors installed, so it created data for itself.

AlphaFold solved the specialized problem of protein structure and folding, using large amounts of specialized data to solve a single-dimensional problem. Its algorithm also went through several different iterations, and for a considerable period in the middle, leveraged human prior knowledge, physics models, math models, etc., to help it solve problems during its development.

These are three different development paths; you can choose your own reference answer based on your situation.

The challenges emerging in embodied intelligence today will ultimately be resolved in ten years either by following one of these three cases as a blueprint, or by fusing the strengths of all three to form a cross-cutting solution. Which one it will be is an open question — we can only raise questions, not give definitive answers.

The above is for your reference and consideration. Thank you.

Billions of Years Ago, Biology Built the First "World Model"

The Hotter the Robotics Track Gets, the More We Need to Respect the Fundamentals | Li Feng Column

Has Embodied Intelligence Reached Its "ChatGPT Moment" in 2025? | FreeS Research

The Road to Embodied Intelligence | FreeS Report 37