Exclusive Interview with LimX Dynamics Founder Wei Zhang: Embodied AI "Isn't Competitive Enough," World Models, and Robot Brains

峰瑞资本峰瑞资本·May 22, 2026

Demystifying the "World Model"; "You Can't Win If You're Afraid to Lose"

Two years ago, when Zhang Wei, founder and CEO of LimX Dynamics, first appeared on the "High Energy" podcast, humanoid robots were just beginning to heat up.

Two years later, sitting back in front of the microphone, he has already gone through several rounds of transformation from academia to commercialization. In those two years, embodied intelligence went from a relatively niche technical direction to one of the most crowded stories in capital markets. Funding races, soaring valuations, IPO queues — "world models" and "robot brains" became standard features in every pitch deck.

Yet many of Zhang's judgments run against consensus: this track isn't too crowded, but rather doesn't have enough people yet; humanoid robots shouldn't go into factories; models don't equal brains, and a brain is an Agentic operating system powered by multiple models.

In this industry observation episode, Michael Mao, founding partner of FreeS Fund, once again invites Zhang Wei to discuss what has actually happened in the embodied intelligence industry over these two years, and what has been the fastest dimension of growth for someone who went from researcher to CEO.

The main topics they cover include:

  • Why isn't the embodied intelligence track crowded, but rather "doesn't have enough people"?
  • Should humanoid robots actually go into factories?
  • Is the "world model" a technical breakthrough, or just an extension of VLA?
  • Does stuffing physical formulas into models actually work?
  • Will there be a unified general-purpose robot model?
  • If models aren't brains, then what exactly is a robot's brain?
  • Among the five transformations of a professor-turned-entrepreneur, which was the hardest?

Below is the full transcript, hoping to offer entrepreneurs and practitioners following "embodied intelligence, robotics, and world models" a new angle for consideration. We also look forward to journeying with more innovators — pitch decks welcome at bp@freesvc.com.

Interactive Giveaway:

Imagine you have a robot, but it's always a beat slow to react. What task would you ask it to do where its "sluggishness" would make you instantly lose it? Share your wildest ideas in the comments. By 17:00 on May 30, 2026, the three most creative readers will receive a copy of Build a Large Language Model (From Scratch).

Build a Large Language Model (From Scratch)

By Sebastian Raschka

Translated by Libo Qin and Xiaocheng Feng

China Industry and Information Technology Publishing Group, Posts & Telecom Press

01 "Not Crowded — Not Enough People Yet"

Michael Mao: Over the past two years, robotics has only gotten hotter. Every time people thought it had peaked, something even hotter came along. The heat brings funding, attention, and enthusiasm from government, industry, and the public. The heat undoubtedly brings many benefits. Beyond the benefits, what have been your main anxieties and confusions during these two years?

Zhang Wei: Let me start with industry changes. Everyone in this track feels it's crowded, but I don't think so. I genuinely feel there aren't enough people yet.

Michael Mao: When you say it's not crowded, is it because there's no homogenization yet, or because people haven't converged on similar solutions to the same problems and started competing on efficiency?

Zhang Wei: Because people always view it through the lens of vertical internet tracks — looking for scale, believing you have to be number one to survive. With that perspective, they keep asking who will stay at the table and who will win.

But embodied intelligence, in my view, can't be loosely compared to a vertical company category. It should be compared to the entire internet. The internet can have Meituan, Alibaba, ByteDance, Tencent, and in verticals there can also be DiDi, Douyin, Kuaishou — so many. The opportunity here is enormous. As long as everyone does their job well, I think everyone can survive. So there's no crowding.

Michael Mao: If you put it that way, I think it's more like new energy vehicles. Since 2014, people first questioned whether new energy vehicles themselves would work, then questioned whether the new automaking forces would work. Ten years later today, in hardware-related vehicles, at least for now, everyone has found a place to survive. Whether it's the new forces, traditional companies restarting their new energy path, joint ventures shifting to self-owned brands, or gasoline cars transitioning to new energy — though once doubted, each has found its own spring. Of course, there's also a harsh reality: the popular new energy brands and products change every year. Do you think robotics will end up like this?

Zhang Wei: Embodied intelligence is vastly larger than new energy vehicles. New energy vehicles essentially just need to accomplish getting from A to B, and they're consumer-facing. There can be many players, including some traditional automakers, and vehicle forms can still evolve — business vehicles, commercial vehicles, trucks, transport vehicles. Looking at mobile chassis, there's also logistics, which broadly counts as vehicles.

But embodied intelligence is far broader than this. So the domain focused on new energy and satisfying consumer mobility is a relatively vertical track. Embodied intelligence has far more than this single user objective — it's like blood that can permeate every industry. Put embodied intelligence into a car, and people call autonomous driving the first embodied intelligence application to land. So I think the space for innovation here is enormous. We're far from converging on a specific objective, and the users under that specific objective are equally undefined. We're not yet at the point of competing on cost-performance and price, because the functions aren't even fully realized. Overall, the space is enormous, in my view.

Michael Mao: Let's switch to a specific, potentially troubling question. Today there's a funding race — raising above certain valuations, above certain amounts. You entered this industry relatively early. Does this funding race, this jostling, cause you anxiety and trouble?

Zhang Wei: It does, but not really anxiety and trouble, because we have our own principles and fundamentals on this matter. We're relatively conservative on funding PR, firmly believing that value and valuation should maintain a certain proportion — that's healthy development.

Then why do we also raise funding? Fundamentally, capitalizing is an important capability for future development in this industry. It's not optional. This track requires sustained investment, and enterprises will need substantial capital during future scaled delivery. In this sense, funding is necessary. But it's not about thinking unclearly about what to do, raising a large sum first and figuring it out later — that's not our style.

02 Humanoids Don't Enter Factories, Continuously Grow Apps, Head to Homes

Michael Mao: About a year ago, there was endless debate about why robots should be human-shaped, why they should be robots, and what they would eventually do. At that time, you mentioned that your company's technical direction was still to avoid sending them to work in factories. Has your perspective changed today, given technological and company developments?

Zhang Wei: Two aspects: some are personal choices, some are our industry judgments — these aren't the same thing.

Starting with personal choice. I threw out a slogan: humanoids don't enter factories, but not all robots don't enter factories. There will definitely be robots serving industry, and embodied intelligence will definitely develop in industrial domains. But two-legged humanoids — our company doesn't position them as efficiency tools in factories.

We have a slogan: "Serve people, not process." Humanoid robots serve people, not production processes. They're fundamentally not the most cost-effective, efficiency-optimized products. Factories are designed for machines; our humanoids are designed for environments made for people, providing services to people — that's the positioning difference.

This isn't a right-or-wrong choice. Many robot forms, like robotic arms or new manipulation methods with tactile sensing, are suitable for factory deployment. Dull, complex, dangerous, and harmful jobs in factories need embodied intelligence to improve — we don't reject that. But two-legged humanoid robots, we feel there's no need for them to enter factories, and that's how we've chosen.

Full-size humanoid Oli ascending stairs based on real-time terrain perception

Michael Mao: Let's unpack this further. Embodied intelligence looking human can functionally be divided into integrated "arms plus legs," more arm-biased manipulation, and more leg-biased walking and running capabilities. The sports event just concluded mainly demonstrated running-around capabilities. From today's perspective, considering public attention heat and the evolution of technological applications, are different upper and lower limb capabilities more likely to iterate and apply separately for a period, or more likely to iterate and apply together?

Zhang Wei: Embodied intelligence is essentially the terminal carrier that can host this intelligence. In form, there are humanoids, and there are half-humanoids — I call them wheelchair-type robots, with upper bodies on wheels. There are also only lower bodies, quadruped dogs, dual-wheel-legged, bipedal — various forms.

We make two categories. First is general-purpose humanoid: two legs, two arms — only this category do I call humanoid.

Why is this humanoid form needed? I have a formula: Maximize number of tasks over form factor — that is, maximize the variety of tasks completable without changing form. To serve the greatest variety of tasks humans need, humanoid is the only optimal solution. I've calculated that at roughly three to four task types, this form becomes necessary — there's no other solution.

For example, passing through building turnstiles, picking up packages or placing them on shelves — this configuration must be humanoid, there's no alternative. That's the optimal solution. And this optimal solution's objective function, cost function, is task variety.

Other forms, like dual-arm, single-arm, quadruped — I call these maximize over ROI, or maximize over efficiency, meaning maximizing efficiency under certain task constraints. Quadruped might be best in some cases, or single-arm, dual-arm — all possible. So we also have a product line called "TRON," a base that through SKU combinations can achieve virtually any form. Humanoid covers the most general single form; TRON covers all other specialized forms.

Michael Mao: As an investor, I'm curious. This industry is still in a very hot phase, but it ultimately has to land in applications themselves and generate commercial value. Of course there's already some commercial value today, like performances. In the foreseeable future, where do you think its commercial and business value will start to show and materialize?

Zhang Wei: The unfolding of humanoid commercial value is somewhat non-consensus — otherwise there'd be nothing for startups to do. I think it's a process of gradually growing apps.

Current performances are one application. Customers can make money with it, it's being used — not just sell-in, but sell-out. That matters.

Next I call it "mouth moves, hands don't." Performances replace actors, and actors are valuable. Next, don't always think about having it do physical work, replacing workers. Actually, having it "move its mouth without moving its hands," replacing smart people, is more valuable. Business services, tour guides, shopping guides — these domains all work.

LimX Dynamics full-size humanoid robot Luna

It's a mobile consultant equipped with language capability, with certain emotional value and new experiential value. It doesn't change the physical world, only interacts.

I break it down into three stages: no interaction, weak interaction, and strong interaction. No interaction means it just moves on its own — like a performance, you press a button and it does a routine. Weak interaction is language and motion coordinated together: you talk, it acts, no need to lift a finger. First, figure out what can be replaced — standing there all day barely working, but needing to speak. The essence of AI's transformation is replacing mental laborers, especially mediocre ones. Strong interaction depends on how a particular skill's data evolves, including pre-training and post-training for that skill type, whether the overall data cost and commercial value it creates can break even. Find that direction. We're also taking an open attitude toward this — you need to know pre-training and post-training, but you have to explore the commercial direction. As long as it breaks even for a single skill, that's enough.

Li Feng: Let me jump in here. Actually, many relatively new things don't land exactly how everyone predicts. Take performance as an example. From 2015 to 2017, the investment industry aggressively poured money into drone companies. DJI was on fire back then, with sky-high valuations. Later, people realized that while drones were hot, many scenarios just didn't work. The first scalable application turned out to be drone performances. Coincidentally, that was when fireworks started getting banned. Today, drone shows have become an accepted business.

Going further, drone performances later advanced several capabilities: flight control, coordination — scaling from dozens to thousands to tens of thousands of units, with costs driven down. Plus, the new energy vehicle supply chain reduced sensor costs another notch, and model evolution pushed control systems forward one more step. Only then could they enter the small-B and consumer-leaning markets. Applying the same logic downward, as world models and autonomous driving technology develop and costs fall further, drones can eventually go to C. This probably took about ten years longer than originally imagined, waiting for both software and hardware to iterate to this point. Robots may evolve following the same logic.

Zhang Wei: I think this era can't just be about calculating ROI for replacing humans. Adding an experience dimension is extremely valuable. But if it's only performance, I do think the ceiling is limited. If, as I said, it's a step before having numerous apps, then it makes sense. Anyway, for humanoids, we're not going into factories. Eventually heading to the home — I think that's the biggest, most exciting, and most suitable direction. Household tasks, you know, you don't do laundry that many times a week. It's essentially diverse tasks, not specialized tasks, so humanoid form works better. Then commercial spaces like malls, hotels, companies — those will also be the earliest landing spots.


Demystifying "World Models"

Li Feng: The hottest new robot story of the past six months has been world models. (Extended reading: FreeS Fund's Li Feng: For Embodied Intelligence to Land, It May Need to Copy These Three Homework Assignments)

Today, every robotics company has to include this word when pitching to investors. Even many industries not closely related to robots are stuffing it into their business plans — saying things like "I'm the emotional part, the affective part, the memory part of the future world model." Everyone's overusing the term. From the perspective of a robotics practitioner and startup CEO, what do you think of this word?

Zhang Wei: Overall, I think world model is a direction worth looking forward to — a new path for breaking through embodied data scaling laws — but it's still in a relatively early stage, with many definitions still rather fuzzy.

Let me start with my understanding of it. First, when we talk about these four characters "world model," the first thing to do is demystify them. A world model is essentially a "model"; "world" is just the modifier. In essence, all the physical and non-physical models we've encountered since childhood are world models in some sense — just differing in the size of the world, its degree of openness, and the physical variables that can be observed.

Since it's a model, it basically takes the current state of the system — the state — and the potential actions we can take to influence this world — the action — to predict the world's state over some future period, and the output corresponding to that state, the observation.

In embodied contexts, due to data constraints, observation is still primarily visual. Generally, it predicts the video information the robot observes while completing a task, so it's naturally related to video generation model technology.

But people found that merely predicting future video isn't enough, because tasks can't be completed. So now there's a new term: world action model. This model must simultaneously predict the future task-completion video and the action corresponding to that video. It's an upgrade from traditional VLA.

Why is it hot? On the technical level, I don't think it's any major breakthrough. Essentially, people saw the limitations of traditional VLA in data scaling, and world models showed new hope for data scaling. Technically, it's an extension of the VLA paradigm. Of course, personally I think VLA's definition could be broader — traditionally, VLA uses VLM as backbone, while world model replaces the backbone with video generation technology. This way, when training robot manipulation, you can leverage video data with temporal information. Compared to single-frame static visual information, temporal information better expresses the physical laws of the world. At the same time, as a potential source for scaling or generalization, video data is certainly easier to scale and collect than real-robot data.

So now you see many companies collecting human manipulation video data, especially ego-centric video data. Not only is it easy to collect, but we also know there's a lot of historical video data on the internet that people hope to utilize, among other reasons. Technically, it shouldn't count as any major breakthrough — otherwise, not so many people would be able to do it.

LimX started exploring using video data for manipulation model pre-training around mid-2024. In early 2025, we released something called VGM — video generation motion — a typical world action model. Last year we also had an interesting and decent conference paper, and a CoRL paper called GVF Tape, a very interesting world action model that doesn't require much data. I didn't use the term "world model" back then, because NVIDIA hadn't coined it yet.

Li Feng: One question related to world models is that it has many different branches. One category involves adding more mathematical formulas — or physical formulas — for predicting and expressing the physical world into the model. That is, using humanity's past understanding of various physical quantities (gravity, fluids, soft bodies, friction, temperature, electricity, mass, magnetism), incorporating some portion of them into or alongside the model. What do you think of this?

Zhang Wei: This question is currently unfalsifiable and easily contentious, so I'll only speak from my personal angle.

First, there needs to be a unifying principle: adding all these things is essentially adding data. Physical formulas, like the neural networks we train, are data compression. We consider Newton's laws to be a compression and representation of all motion data. Adding Newton's laws is essentially adding motion data in the most parsimonious way.

Viewed through this unified data lens, things become clearer. There are two matters. First, does adding these physical law mathematical formulas bring new data increments — or more fundamentally, information increments? Does our understanding of this world's information increase? Second, it may have increased information — we wouldn't do this if it didn't — but can we effectively utilize this increase? This is essentially an alignment problem. Can these other data representations align well with the current world model's representations? If they don't align, it could backfire.

From first principles, adding these physical law mathematical formulas certainly brings new information, especially new modality information. The process of human abstraction of physical laws involved extensive non-visual measurements — electrical, magnetic, force measurements, and so on — so adding them is essentially adding representations of these modalities, an increment.

But the crucial problem is that these representations are very difficult to align with world model data. This is the hardest part. So most approaches using physical formulas essentially turn them into simulation, generating more data that can easily align with visual observation data, then using this data to train world models. But I want to say this alignment is still quite difficult — the entire Sim to Real field is doing exactly this alignment. Overall it's useful, but wanting to use it well, to align physical laws with world models, is very challenging, not as easy as imagined.

Li Feng: Today there's a very hot startup direction that shares the same root as this phenomenon. Today's data, especially data with physical quantities, has hardly been well accumulated historically. So data collection and processing startups have become incredibly hot. As an embodied company that needs data and iterative models, what do you think of this?

Zhang Wei: I believe producing embodied models is no different from manufacturing. Data is essentially raw material, training is the production line, and what comes out is the model. A model is essentially an expression precipitated from data through collection, processing, and training. The source of the production line is data, going through pre-processing, incoming inspection, then training, finally giving you a model. From this perspective, the richness of various modality information as raw material is very important.


The Model Is Not the Brain; the Brain Is the Operating System

Li Feng: Today there's another fashionable term: "robot brain." At least some high-valuation companies like yours have to say this is what they're building. But the definition of this term is extremely diverse. What do you think of it?

Zhang Wei: I now have a relatively clear understanding of it, though this understanding is still constantly iterating.

Two points where I differ from most people: First, I don't think the model is the brain. The model is not the brain. Second, the brain is also not a model — the brain is an operating system, an operating system. This is my definition.

Why do we have COSA Agentic OS? The brain is an operating system. It doesn't just manage memory, storage, and thinking — it's an agent. I consider it an Agentic OS. It needs to call various models, including VLM, LLM, world models, and various VLA models, to complete a task. It also needs to call many tools. So we don't believe you can train a brain by stacking operation data. Essentially, it's an operating system built on top of model capabilities.

Li Feng: I think ordinary listeners might find this a bit confusing. Is the brain something above the large model layer?

Zhang Wei: No, no, no. COSA is a brain, but COSA is built on top of large language models — its capabilities depend on which underlying model you use. This used to be very confusing to explain, and even drew criticism. Now that OpenClaw is out, it's a bit easier. Sheng Fu said something similar recently: the large model isn't the brain, OpenClaw is the brain. We agree with that. An Agentic OS is a brain. The brain is an operating system, while various models are the tools and skills the brain uses to think and complete tasks.

Many people have a relatively narrow understanding of "skills" — things like a robot picking up, placing, or twisting open a bottle cap, these atomic action skills. But the skills I'm talking about can be very broad. Autonomous driving as a whole — though not yet realized — if achieved, the corresponding model would be a skill. This model isn't fully equivalent to a human brain. A person can be very smart, very capable, but simply hasn't mastered the skill of driving. That's fine. So the essence of the brain is still an operating system layered above the models.

Li Feng: This might feel counterintuitive to most people. Most people think the model is the brain, and COSA is operational capability or an extension of it. From your perspective, it's the opposite.

Zhang Wei: I can't speak for others, but we have a very clear architectural understanding of the robot brain. It's become clearer recently — over the past year or so it has gradually converged and matured into a three-layer architecture, which is somewhat different from others.

LimX Dynamics' Three-Layer Technical Architecture

The bottom layer is the cerebellum foundation model. The cerebellum handles motion control — this is System 0. It's pure movement. You could think of it as a zombie: no brain, it just moves however you tell it to move, and it has to complete the action you want. For traditional robotic arms, this is relatively simple and can be done with conventional methods, but humanoids need some AI capability.

The next layer up is Humanoid VLA, the humanoid Vision-Language-Action model — this is System 1, a higher-order skill. Movement has to be related to environment and task. It needs to combine what the eyes see with the movements I can command, and get the thing done. This is the middle VLA skill layer.

The top layer is System 2, the Agentic OS. It's the entire Agentic system with large models as the engine — that's our COSA.

Why divide it this way? Consider this scenario: a person lies paralyzed in a hospital bed, very smart, can't move at all, but can think. Does he have a brain? Yes. Can he move? Not at all. If I give him a skill, unblock that meridian, he can reach for a water cup — that's giving him a VLA capability. The model gives this person a skill. Then what should his brain be? I think it's an OS.

Another example: I want to go downstairs to buy a coffee and bring it back. For this, I need to make decisions — do I have time right now? I need to make this decision. I need to go out, invoke my skill for opening doors, invoke positioning and navigation. After going out, I need to use GPS — I can't possibly train a model to encode GPS into it. So this whole compositional thinking, I believe, is the work the upper-layer Agentic OS needs to do. And its intelligence and maturity depend on the progress of large models. The core of the brain is still fundamentally the language model thing, because language is the essence of thinking — human thinking is conducted through language.

Now humanoid robots can dance, do backflips — these are skills, pre-recorded motion replays. But to make humanoid robots truly capable of working, the "brain commanding the body" part requires a foundation model. It can't be: I see this cup and want to grab it this way, then go back and train for a week before I can grab it that way. That won't work. This requires not pre-programmed motions, but rather: whatever motion you need, it can complete. We've invested a lot of time, effort, and data into this foundation model, and I think this is important for humanoid robots.

Li Feng: So this foundation model is more related to motion control?

Zhang Wei: Yes. The cerebellum's foundational capability is that it can execute whatever motion you ask it to perform. Whatever motion you want, it can execute for you.

This is similar to the foundation model for language. In the early days, basic dialogue was pre-arranged — you say "hello," I respond "I'm here" — that's pre-arranged, not a foundation model. Being able to produce desired responses to any input, that's what a foundation model is. The foundation model for movement is the same. You give me a reference trajectory, which can be thought of as a prompt, and I can complete that motion. This underlying capability is relatively important for building the upper-layer VLA and ultimately completing tasks.

Li Feng: Let's set aside the technical questions for now and talk about something else. These robot companies with high valuations are all preparing to enter capital markets in various forms, whether A-shares or Hong Kong stocks — you're among them. But from an investor's perspective, if it's a major boom, it often ends with the boom's signature company going public, or about to go public, or having just gone public. Whether you look at Facebook or Alibaba in 2007. If these intelligent robot companies list on capital markets around the end of this year or next year, and that becomes the peak of this boom, what do you do?

Zhang Wei: This industry may be somewhat different from others, and not fully comparable to new energy vehicles either. But the example you just gave is perhaps more similar to the new energy vehicle listing phase. When Wei Xiaoli went public, it wasn't when the industry was mature — they actually dropped afterward. When technology improved slightly and people could accept their presence in capital markets, companies that listed later actually had a capital aggregation effect and could ride that wave.

Li Feng: Let me rephrase. Assume at some point, the industry's current heat drops by 50% or more. As a company in this industry, what do you do?

Zhang Wei: The answer is the same across all industries. But embodied intelligence has an advantage — I call it "higher ceiling than large models, higher floor too." The ceiling has more imagination. Building a truly general-purpose humanoid robot, the ceiling is no smaller than a model company. Second, it's not like model companies where if you miss one generation of models you're completely finished. It can always find vertical domains to apply. So in this space, companies that do their job well, follow true commercial logic, and carry some technological breakthroughs can all succeed. I'm relatively optimistic at this stage — high floor, high ceiling — because current investment levels haven't gotten that high yet.

TRON 2, a Multi-Form Robot Designed with Modularity in Mind


No Need to Wait for General Models to Deploy; Being Able to Peel an Egg Doesn't Mean You Can Drive

Li Feng: These different types of data — will they make robots more "vertical, constrained" over the next 1-2 years, meaning with boundary conditions and faster application deployment in specific scenarios? Or will they become less constrained, with faster model upgrades and iteration in generalized scenarios? Both could exist simultaneously, but as of today, which direction is more dominant?

Zhang Wei: Both progress simultaneously. But I have a somewhat different observation: in the deployment process of embodied models, whether VLA, VA, or world models, embodied deployment cannot follow the large model pattern of "general first, then specialized, then application, then deployment." At least that was my previous judgment, and it's what we're practicing now. Because across different skills and tasks, the relevance of data and data requirements are very different.

Li Feng: Can you give a concrete example of why data for A and B are very difficult to correlate and generalize?

Zhang Wei: Language data is a general modality. Whether you're writing a lawyer's letter or a document, it all helps the general model. But if I take autonomous driving data and egg-peeling data and train them together, we don't know what will happen yet — it might actually be problematic.

In the cross-skill data pipeline, expecting stacked data to produce emergence without understanding the relationships between data is, in my view, marking the boat to find the sword. So we believe general model capability is needed, but that capability grows gradually. I call it the "flywheel of general models and scenario data." On the basis of a certain general model capability, collect data in vertical domains as much as possible, even deploy. Collect more data during deployment, which then feeds back into the general model — that's how general model capability grows, rather than first becoming extremely general and then deploying. So far, what I see is what I call the flywheel of general and scenario data, and this is basically the strategy we're practicing.

Li Feng: If a general model eventually becomes a more general robot model through this flywheel iteration, its capabilities sound far beyond predicting the next token, and far beyond what language models require. Would this model be understood as much larger than today's language models, rather than today's few-hundred-billion-parameter probabilistic models?

Zhang Wei: Good question. If it's that general of a model, it would be very large.

Li Feng: If it's that general and becomes that large, there may be challenges when using it on robots in the future. Looking at how people use large models today, the number of chips stacked on a robot, the power consumption, the energy consumed to do one thing, and the computing resources needed inside its body might be beyond imagination — because it needs to run such a large model, and in real-time. You can't peel an egg where it takes two seconds to peel the first piece, two more seconds for the second piece, and by the time you're done, I'm no longer hungry. So it needs to be fast, needs real-time — that's hard to compromise on.

Maybe chips will have advanced by then, maybe controllable nuclear fusion will have developed, assuming you can carry an infinitely small controllable fusion reactor. Of course that's science fiction. But it sounds like there would be various problems and obstacles. So if not this, what other solutions or possible directions might there be?

Zhang Wei: I've always thought it's about skills.

Li Feng: How is a skill defined?

Zhang Wei: Driving is a skill — I can not know how to drive and still have a brain. Peeling an egg is also a skill. This skill requires separate training data. When I don't know how, I have to learn. It doesn't need a general model that can do everything. It needs various skills that help with scenario deployment.

What people don't see yet is that the commercial value created by this skill, and the data cost required to create this skill, don't quite break even yet — or rather, no one has found a case in a certain vertical domain that breaks even yet. Everyone's looking.

Li Feng: From your perspective, autonomous driving is probably similar.

Zhang Wei: It's an embodied skill, not a brain. It's just one part of the human brain — a skill. Being able to peel an egg doesn't mean you can drive a car.

Li Feng: But today's large language models are essentially probabilistic models too. Their word prediction is a probabilistic process, so we can't really say it's a semantic understanding process. Do they truly understand this content?

Zhang Wei: What is understanding, anyway? Their understanding and what we call understanding — they might not be the same thing.

If we treat understanding as an abstract definition, then as long as it can predict something similar or consistent with what you do next, that can be called understanding. We can make that kind of abstract definition. So understanding is hard to pin down clearly. Our understanding of something depends on our judgments about what it entails, and the data behind our decisions.

Li Feng: I think the biggest difference here might be that the kind of understanding humans want achieves abstraction of things through extremely limited sensory or composite data. But today's language models, because they're probabilistic models, need massive amounts of data to complete their predictions.

Zhang Wei: I'm not sure if they need more data than we do. We've actually evolved over hundreds of thousands of years. The data your brain has used through evolution has been accumulating for that long. What we pass down generation after generation to our children is essentially a pretrained model — they learn after they're born. Fundamentally, the brain at birth is a pretrained model. How do we even compare the historical data volume of this neural network against large models? Because we chat like this every day, and that's a lot of data too.

It actually has more information than we do. Right — it's the union of all humanity.

Li Feng: From what I just heard, LimX's products are roughly divided into hardware and software layers. The hardware layer is general-purpose yet specialized — the same set of components that can serve as arms or legs, plus a humanoid form that carries integrated functionality. The software has three layers: one is more brain-like, the Agentic OS, which schedules and deploys models and skills; below that is the layer of skill models trained on new and existing data; and at the bottom is the motor control foundation model that can execute arbitrary actions in the physical world.

Zhang Wei: Right. We recently open-sourced something I haven't told you about, Uncle Feng — it's called FluxVLA Engine. We didn't open-source the VLA model itself. I think models aren't really usable in their current state; open-sourcing them helps with fundraising, but the model parameters themselves may have limited meaning. We decided to teach people to fish rather than give them fish — we open-sourced the base model for training large models and the entire fine-tuning architecture.

We open-sourced the "model production line." Because ultimately, the data and models that get deployed belong to whoever is deploying in the vertical scenario. What they need is our hardware platform and this open architecture — they have to collect their own data.

FluxVLA Engine Framework Diagram


"If You Can't Afford to Lose, You Can't Win"

Li Feng: You've been a researcher abroad for a considerable time, and also worked in domestic research for a period. Entering the entrepreneurship world, through the entire role transition and the process of growing a company from small to large, you've been through a lot. What are your feelings, experiences, and lessons?

Zhang Wei: There are so many — it's all pitfalls. I have a summary: a professor-turned-entrepreneur needs to go through five transformations — from academia to technology, from technology to engineering, from engineering to product, and from product to commercialization.

In the earliest stage, a professor is essentially proud of academic achievement. The essence of academia is taking pride in an idea, a concept — publishing a paper makes you happy. Then comes the technology stage, where you take pride in presenting that idea in a demo. From technology to engineering is about implementing a technology stably and reliably, and taking pride in that stability and reliability — which has nothing to do with demos anymore. From engineering to product is about moving from reliably implementing a technology to producing something that delivers user value. But the more products you sell, the more you might lose, so commercialization is ultimately what matters most.

At least for me personally, I've gone through several transformations where I had to negate myself, then negate myself again.

Li Feng: By negation, do you mean completely changing how you thought about things during transitions, or adding a very important new weight to your original thinking?

Zhang Wei: It's adding new dimensions — ascending to higher dimensions. The original dimension was too small. In entrepreneurship, academia isn't important, technology isn't that important either — ultimately, business is what matters. Centered on your business goals, you need to have the ability to determine what kind of technology you need.

The current stage is like what I mentioned earlier with VLA finding scenarios. You need to understand business design and have relatively accurate predictions about where technology is heading, so you can find the intersection at some future focal point. Because this isn't mature, landed commercialization — you need to predict the trajectory curve of technology, then make choices at that intersection. This requires some technical strength.

Li Feng: Isn't this exactly what we often say in investing — one is carrying a hammer looking for nails, the other is finding a suitable nail and shaping the hammer to be the best for hitting it.

Zhang Wei: Right. The biggest hammer right now is AI. This AI paradigm will activate many possibilities. The essence is finding some possibilities, using this technology trend and technological power to solve for potential commercial value.

Going back to those transformations — my biggest pitfall in that process was my understanding of organization. Each change isn't as simple as self-negation. The people you choose, the people you use, the entire organization — they all have to iterate with you. When you're only focused on technology, the people you choose and how you organize are completely different from when you care about engineering, about mass production. So the change isn't just about yourself changing — you also need the ability to drive the entire organization through that change. The people you choose early on are different from later.

Li Feng: We've also seen you go through some organizational structure and personnel changes over the past four or five years — there must have been both experiences and lessons.

Zhang Wei: Actually we changed quite a few executives early on, three or four of them quite early. Later you realize that growth means not thinking too highly of yourself — personnel selection is a two-way street, and when people feel it's not a fit, it's better to speak up sooner rather than later and handle things well. So I'm quite at peace with this now. Someone recently mentioned Zhang Li's departure — that was last year. We had already agreed then that he might focus more on Beijing and gradually step back from company management, but we still respected each other and the transition was quite peaceful. He later served as an advisor for a period. So this actually happened a while ago and didn't have much impact on us. Rather, when these issues happened earlier on, that was truly painful.

Li Feng: The first time, the second time — those were more painful.

Zhang Wei: It's embarrassing to talk about. Especially for intellectuals — it's a matter of face. Later it became more matter-of-fact. Actually I don't think people are good or bad — it's about fit, and whether they match the company's strategy at that time. Those are the key factors.

Li Feng: Because there's mutual need and capability issues, but also the time span issue — fit usually doesn't mean forever; it might just be this year, these two years.

Let me ask something else. We've also invested in many professor-related startups. Professor entrepreneurs, in their initial stage — longer or shorter, six months, a year, possibly two or three years — tend to share a common phenomenon: they easily become idealistic, hoping everything, everyone, and every function can be solved in one step like solving an equation. For example, if they're missing someone with skill set A for a position, they hope to find a complete A to plug in, and from then on that matter is settled once and for all.

Zhang Wei: I've been through that too. I still occasionally have this wishful thinking in areas I don't understand. You hope to find someone and then you don't have to worry about it anymore. But actually you have to be the backstop.

Li Feng: There's another small challenge. Some professor entrepreneurs, when they start out, prefer to find a CEO right away — which is even harder.

Zhang Wei: This relates to the professor's own positioning. If they just want to be a technology contributor, an early contributor, finding a professional manager might work. But if they're the major shareholder and bring in a professional manager too early, before business has gone from 0 to 1 and while the entire business is still being explored — that's quite challenging.

Li Feng: Or in a sense, even if you want to find a CEO, you should first find a COO. At least there's room to observe, adjust, and coordinate — room to move up or down. CEO is the top; there's no other position above it. If there's even a little mismatch, there's no adjustment space.

Zhang Wei: In a rapidly developing track where the 0 to 1 exploration isn't complete, the largest shareholder is the CEO. Regardless of whether they call themselves CEO or call someone else CEO, it's essentially self-deception. If it's a mature business — say, I'm opening a milk tea shop — then it doesn't matter, you can find a manager, a mature manager.

Li Feng: Phenomenologically speaking, in most cases, people who are both competent and effective emerge or are selected through a process. They gradually become that way; they don't become that way in one step from day one.

Zhang Wei: Phenomenologically speaking, yes, it happens gradually. But essentially it's a cognition issue for the founder or the number one person. If they're very clear that achieving this requires ABCDE, and can figure out what capabilities are needed for each piece, then the probability of choosing the right people and getting things done exists.

What I've concluded is that people are always most seduced in areas they don't understand, and most easily seduced there. Hoping for one-step solutions, being swayed by certain things to make choices — this essentially comes from lacking many dimensions in their overall business understanding. There are also mature entrepreneurs who are particularly susceptible to seduction by technology and academia, who have a kind of tech worship. People who understand see that this person won't work. So people are most easily deceived in the areas they're least good at.

Li Feng: You just said organizational capability is the most core thing. The embodied intelligence industry has been very hot in the past year or two, with young people flooding in. What are your thoughts?

Zhang Wei: We hope the organization becomes more vibrant, recruiting more young people, more AI-native people. We have our own understanding of what "young" means — it doesn't necessarily have to be age. Everyone's talking about how post-00s and post-90s are doing AI, and we think there's some truth to that. But there are also people like us — older in age but with a youthful heart and youthful energy.

Li Feng: Sounds like it's more about mindset and thinking.

Zhang Wei: I think there's one thing that can't be wrong — when people are young, many things come naturally: their openness, their imagination is vast. But later I discovered the most critical point — there are many people with open mindsets, many with passion. For someone relatively older, the biggest limitation is actually a sense of achievement.

Li Feng: Possibly.

Zhang Wei: Someone who previously managed tens of billions in business, asking them to do something worth a million — they can't find that sense of achievement. But young people haven't done this before, so they're very motivated. This is actually something uniquely young, very hard to find. So we can be flexible on age. Our "young" in quotes includes openness, includes experience, and its source of achievement can come from being happy about small things. I think that's our biggest realization about what "young" means.

Li Feng: To wrap up. As a CEO who's spent five years building a company and a researcher with over fifteen years of academic experience, you've lived through both the coldest funding winter and the hottest two years this industry has seen. Looking back on these five years, what's left the deepest impression, or what do you most want to capture about how it felt?

Zhang Wei: There's still so much I need to learn. Every time you level up, every new dimension you add — it's a rapid, painful process. The pain gives way to something enjoyable, but before you can even savor it, the next challenge arrives. You realize you're still not good enough. It's the fastest possible form of self-cultivation: recognizing your flaws, then rapidly learning, filling gaps, iterating. Every minute felt like pressure, like challenge. But looking back, the overall arc feels meaningful.

My biggest realization — and this is sometimes hard to explain to investors, but it's the truest thing I feel — was understanding that a startup can die. That was my biggest dimensional shift. A lot of people don't really get what I mean, but for me it was everything. If you can't afford to lose, you can't win. You have to be able to accept that something might fail.

Li Feng: This connects to what you said earlier about making yourself smaller, what we'd call ego. When ego is large, you can't accept failure. When ego is smaller, you can accept that the thing might fail. When you see yourself as too important, you can't accept your own failure — or the appearance of failure in others' eyes.

Zhang Wei: That need to win at all costs — when that takes over, it backfires more often than not. You can't make decisions with composure, you can't see the world clearly. Accepting that something can fail, that a company can even die — that was my single biggest personal growth, bar none. It's the most significant marker of my entrepreneurial journey. I don't remember exactly when, maybe year three or four, struggling through it all, I finally saw this, and it made me genuinely happy.

Li Feng: Or as we might put it: push the floor lower, and you can push the ceiling higher.

Zhang Wei: If you can't afford to lose, you can't win. That's how I see it.


Reader Giveaway:

Imagine you have a robot, but it's always a beat slow to react. What task would you ask it to do where its "sluggishness" would make you instantly lose your mind? Share your wildest ideas in the comments. By 17:00 on May 30, 2026, the three most creative entries will each receive a copy of Build a Large Language Model (From Scratch).

Build a Large Language Model (From Scratch)

Sebastian Raschka

Translated by Qin Libo, Feng Xiaocheng

China Industry and Information Technology Publishing Group, Posts & Telecom Press

FreeS Fund's Li Feng: For Embodied AI to Land, It May Need to Copy These Three Playbooks

Billions of Years Ago, Biology Built the First "World Model"

The Hotter the Robotics Track Gets, the More We Need to Respect the Fundamentals | Li Feng Column

Did Embodied AI Hit Its "ChatGPT Moment" in 2025? | FreeS Research

Congratulations to "Effiar Technology" on Its Hong Kong Stock Exchange Listing, FreeS's 3rd HK IPO of 2026