The Road to Embodied Intelligence | FreeS Report 37

峰瑞资本·May 24, 2024·40·1

Understanding Embodied Intelligence: Beyond Humanoid Robots

Over the past year or so, while the large model wave has remained red-hot, the embodied AI track also seems to have entered the fast lane.

Multiple tech companies have launched landmark products: Tesla's robotics division released Optimus Gen2, capable of performing highly dexterous, flexible manipulation; Stanford's Mobile ALOHA robot can autonomously complete a shrimp-frying task; Google's RT-2 achieved end-to-end task understanding and operation through large models; and several startups like Figure have delivered their own阶段性成果. All of this leaves us wondering: is embodied AI really almost here?

This is the fourth installment of FreeS Fund's Embodied AI Series. In the first three installments, we organized three in-depth conversations with Zhang Wei, founder of LimX Dynamics; Lian Wenzhao, a researcher at the Chinese Academy of Sciences' Institute of Automation; and two of FreeS Fund's tech investment colleagues, Yan Qianhang and Liu Pengqi. From the perspectives of entrepreneur, practitioner, and investor, they shared their experiences of embodied AI's technological development, industrial落地 opportunities, and capital frenzy. See the links below for details:

▲ Li Feng in conversation with Zhang Wei, founder of LimX Dynamics: Humanoid? Robot?

▲ Li Feng in conversation with Lian Wenzhao: The Imagination and Bubble of Large Models, the "Impossible Triangle" of Robots, and the Future

▲ How Far Is Embodied AI from Reality: Boom, Bubble, Technological Frontier, Commercialization

The research report we're sharing today was presented by Liu Pengqi, Executive Director at FreeS Fund, at a FreeS Fund monthly meeting. We're treating this report as a阶段性总结 of the Embodied AI Series. We will continue to track developments in the embodied AI space and release follow-up content — stay tuned.

Returning to this report, it mainly addresses the following questions:

What exactly is embodied AI?
Which technological advances have driven embodied AI to its current level?
What have large models brought to embodied AI?
How far is the road to embodied AI? Are humanoid robots the endgame?
In the technological evolution path of embodied AI, what opportunities remain for entrepreneurs?

We hope this offers fresh perspectives. If you're an entrepreneur or practitioner in the embodied AI industry, you're welcome to reach out to the author Liu Pengqi at pengqi@freesvc.com.

Engagement Perks

What do you think about the present and future of embodied AI? Share your thoughts in the comments.

By 17:00 on May 31, the three most thoughtful commenters will each receive a FreeS research handbook and a copy of What Is ChatGPT Doing... and Why Does It Work?.

/ 01 / How Did Embodied AI Get Here?

Embodied AI — AI that is, literally, given a body. According to the definition from the China Computer Federation, embodied AI refers to an intelligent system (AI) based on physical embodiment for perception and action. Through interaction between the agent and its environment, it acquires information, understands problems, makes decisions, and executes actions, thereby producing intelligent behavior and adaptability.

Compared with AI in the usual sense, embodied AI adds a physical entity that can interact with the physical world. Compared with traditional robots, it possesses intelligent decision-making capabilities, exceeding the narrow applicability of traditional robots with greater generalization. From an early-stage investor's perspective, we see embodied AI as essentially equivalent to "general-purpose intelligent robots." Breaking this down, embodied AI is actually the intersection of physical entity and general-purpose intelligence.

To observe the technological evolution of embodied AI, we can draw an analogy with human evolution. Starting from single-celled organisms, life on Earth spent billions of years evolving a brain. Once the brain was sufficiently developed, humans evolved the unique capacity for imaginative fiction. Based on this, it took merely tens of thousands of years to construct a "second curve" of human civilization encompassing language, religion, and culture.

In embodied AI, we can see a similar evolution. The following figure gives us an intuitive view of how embodied AI has step-by-step evolved to where it is today.

From the perspective of driving forces: on one hand, industrialization has driven the rapid development of embodied AI hardware including hardware and sensors — analogous to the evolution of our physical bodies. On the other hand, industrialization has simultaneously promoted the semiconductor industry, and the increase in computing power has led to enhanced computational capabilities that have evolved into today's computation-driven paradigm — analogous to the evolution of our brains and intelligence.

So we can first break embodied AI into two parts: physical entity and computational capability, corresponding to "real" and "virtual" respectively.

On the physical entity side, industrialization drove the development of industries such as home appliances, automobiles, and consumer electronics, which in turn accelerated rapid iteration and industrialization of upstream hardware. Along with declining hardware costs and improving sensor precision across categories, we can today see the emergence of numerous embodied AI forms, especially humanoid robot bodies. The bottlenecks of physical entities are being broken down bit by bit.

On the computation-driven side, the earliest automation was software running on PCs or industrial computers. Later, this software gained internet connectivity and online capabilities, bringing the flourishing development of the internet era. These internet applications could even develop rapidly on their own, decoupled from physical entities — the rise of search and information distribution, and social networking platforms, are the best examples. These platforms themselves generate massive amounts of data, and it is precisely thanks to the accumulation of such海量数据 that the scaling law of large models became possible.

Of course, these two lines — physical entity and computational capability — also intersect from time to time during their development. For example, in the automation era, we saw industries like automobiles and industrial robots. In the digitalization era, we saw the so-called first-generation intelligent products of the previous wave, such as assisted driving, AGVs, and flexible robots. Today, these two lines intersect once again, giving us embodied AI as we see it now.

/ 02 / The "Coffee Test" in AGI

Looking specifically at the technical architecture of embodied AI, industry consensus breaks it down into three layers. The bottom layer is the hardware body, which we can analogize to the human body — muscles, bones, sensory organs (sensors), hands and feet (actuators), and so on. The layer above that is motor control functionality, equivalent to the cerebellum. The top layer is the brain, responsible for thinking, planning, decision-making, and environmental understanding.

In the general tests for artificial general intelligence (AGI), there is a "coffee test" proposed by Apple co-founder Steve Wozniak, to measure whether a robot can perform tasks with quality equal to that of a human. In this test, without any specific pre-programmed assistance, the robot must enter an unfamiliar room, find the coffee machine, take out a cup, and brew a cup of coffee.

A task utterly ordinary for humans is extraordinarily difficult for robots. First, the robot needs the ability to understand its environment — to recognize which items in the environment are relevant to "brewing a cup of coffee," and to decompose the coffee-making task. These are brain capabilities. When the cerebellum receives these tasks, it needs to plan and control, including designing movement paths — such as what motions and postures to use to grasp the cup, and if the end effector is a dexterous hand or gripper, what configuration should be used to complete the execution.

Overall, the technical architecture of embodied AI consists of the "brain" layer for environment and task understanding, decision-making, and task decomposition; the "cerebellum" layer for planning and control; and the "body" layer for task execution. Of course, this process also requires sensor data to help the "brain" and "cerebellum" better understand and control the "body." We will analyze these three layers in more detail later.

/ 03 / Future Embodied Robots Won't Necessarily All Be Humanoid

The reason we use the term "embodied AI" is that its form doesn't necessarily have to be humanoid — a point that's often misunderstood. Indeed, in the Chinese context, the "body" in "embodied" and the "human" in "humanoid robot" can both lead people to assume we're talking about human-shaped machines. Let me give two examples of embodied AI that aren't humanoid. Take AI agents — in a sense, these can also be understood as a form of embodiment, except they're physical entities existing in virtual worlds. Or autonomous driving: this too is a form of embodied AI, just one whose "body" happens to be a car. Humanoid robots will certainly exist as one form in future society, but can they replace robots with other physical forms? Humans became what we are today through natural selection, as products of environmental adaptation. By analogy, in embodied AI, different scenarios call for different types, forms, and functions — future embodied AI should take whatever form proves most efficient in its particular context. We often say what distinguishes humans from animals is our ability to create and use tools. So why would we build replicas of ourselves to use the tools we've created, rather than putting intelligence directly into these "tools"? For instance, rather than building a humanoid robot to drive a car, why not simply build a car that can drive itself? Of course, we must also acknowledge that most modern human living spaces are designed around human needs — stairs, door handles — so humanoid robots are undeniably the most universally adaptable form for our surroundings.

Why does Tesla's entry matter so much for humanoid robots? Embodied AI isn't new. Boston Dynamics' robotic dogs became world-famous long ago. But investors ask: why didn't more companies try to follow suit back then? Even Boston Dynamics itself seemed to give little thought to product commercialization and industrialization. It wasn't until Tesla entered the picture that the humanoid robot market truly took off. Elon Musk has always dared to pursue what others wouldn't dream of. He proposed putting humanoid robots to work in Tesla's production, research, and future services. Suddenly, embodied AI wasn't just about "replicating human capabilities" — it had a path to industrial deployment. In this sense, Tesla sparked a great voyage that drew many companies into embodied AI.

The Technical Architecture of Embodied AI

Now, let's break down these three layers of embodied AI in more detail.

The Physical Body: Iteration Constrained by the Physical World

The development of the physical body is constrained by the physical world itself. Its iteration speed is relatively slow, driven more by trial-and-error and accumulated experience. The upside is that it benefits from industrialization — once scale effects are large enough, costs can drop rapidly.

For sensors, their role is perceiving the robot's own state and its surrounding environment. The scaled development of autonomous driving has already given embodied AI a significant boost, since vehicle intelligence depends on getting all kinds of sensors onto cars. But embodied AI demands even more state-perception capabilities, so force sensors and tactile sensors are particularly noteworthy directions right now.

For actuators, we divide capabilities into two broad categories: locomotion and manipulation.

On the locomotion side, technology is generally more mature and advancing faster, and we've already seen many products with high levels of evolutionary iteration. For example, LimX Dynamics' bipedal robot has already been walking in the wild.

But on the manipulation side, while simple grippers, suction cups, and other end effectors are quite mature, they clearly can't meet the requirements of task generality. Human hands can labor, type, perform surgery, conduct experiments, paint — but nothing on the market approaches this level of capable, dexterous, human-like hands. The technology still needs breakthroughs. Our portfolio company, Yingshi Robotics, is working in this direction.

The "Cerebellum": Bridging Virtual and Physical Worlds

The cerebellum's work also divides into two parts: planning and control. The final execution of control is handed off to the physical body.

Planning, simply put, means outputting the optimal path for movement (or manipulation) based on task requirements. Simple tasks like moving or grasping can be solved for optimality using mathematical methods. But in the real world, the tasks agents face are often complex, so planning may be constrained by numerous conditions (robot dynamics, environmental factors, etc.) and involve optimizing for multiple simultaneous objectives (shortest, fastest, most energy-efficient, safest).

Once planning is complete, the control layer handles how to operate the body's various joints to execute the task along the planned path.

Currently, two mainstream control algorithms dominate the industry: MPC (Model Predictive Control) and WBC (Whole-Body Control).

MPC works by using a pre-assumed model to predict the body's next or future state based on its current state, then comparing this predicted future state against the desired future state to optimize control outcomes. This approach is limited by model accuracy and state perception capabilities — the more dimensions and precision in sensors, the more accurate the model predictions.

WBC optimizes control over all of the robot's joints as an integrated whole. This algorithm enables more natural and flexible movement, excelling at handling multi-DOF, nonlinear, high-dimensional robotic systems. However, it too is constrained by computational complexity (real-time performance) and model precision. Most quadruped and bipedal robot motion control today uses both MPC and WBC together.

So how can we find more accurate models? Currently, the main industry direction is reinforcement learning — essentially, autonomously learning control strategies through continuous trial-and-error and feedback.

Reinforcement learning's most widely known application is AlphaGo, which used deep reinforcement learning to defeat the strongest human Go players. But Go is, after all, a relatively closed scenario with very clear rules. In the real world, the physical environments agents face are certainly far more complex and diverse.

How, then, can we train good models through reinforcement learning in complex, uncertain, and dynamic environments — for example, learning to walk on unstable ground or recognize objects under varying lighting conditions — remains a direction that both academia and industry are working hard to crack.

The success of reinforcement learning depends partly on designing appropriate reward functions, which requires experience. It also typically demands massive amounts of trials and data to learn effective strategies, which can consume enormous time and computational resources.

One relatively effective strategy is to experiment and learn in simulated environments, then transfer the learned policies to real-world settings (Sim2Real). But this approach remains limited in applicable scenarios, since building simulation environments is costly and the gap with real physics remains large.

Another important evolution is imitation learning (Behavioral Cloning), which lets robots learn new skills by observing and imitating human or other robot behavior. The advantage is that robots can learn new tasks quickly while avoiding extensive trial-and-error during learning — particularly useful for complex skills like collaboration or dexterous manipulation.

The previously viral Stanford University robot was essentially a large-scale imitation learning platform. By learning from large numbers of human manipulation examples, the robot could autonomously complete certain specific tasks with some success rate. But imitation learning is still considered to lack generalization capability and depends on massive data, which are the main factors constraining its development. Current mainstream methods — teleoperation, motion capture, video, simulation/synthetic data — each have their own problems. Perhaps the future lies in combining them.

To summarize: reinforcement learning is better suited to locomotion tasks, while imitation learning fits manipulation tasks better. But technical approaches are far from converging, and this generalization shouldn't be taken as absolute.

The "Brain" and Large Models

The brain's job is understanding the environment, understanding the task, then breaking the task down into different steps. What distinguishes embodied AI from previous automation equipment is its ability to handle multiple tasks and complex scenarios, plus achieving intelligent, generalized perception and task decision-making.

Take Google's PaLM-E large model, launched in March 2023, as an example. A person can input a task to PaLM-E in natural language. It perceives the environment through autonomous cameras and ultimately outputs task instructions in text form. But it can only output text-based task instructions — it can't directly control the robot. The control layer still requires the "cerebellum" to complete.

The demonstration video of OpenAI plus Figure 01 also showed results from combining OpenAI's brain with Figure AI's cerebellum and physical body. What truly stunned the industry was probably Figure 01's execution capability — for instance, in picking up an Apple, it could generate motion planning trajectories at 200Hz and control whole-body joint torques at 1000Hz.

This raises the question: can the brain's capabilities extend further downward to subsume some of the "cerebellum's" functions? Can large models do motion planning?

What Have Large Models Brought to Embodied AI?

Google has been one of the more advanced players when it comes to iterating the "brain" of embodied AI. Its RT-1 model is an end-to-end small model built on the Transformer architecture, trained through imitation learning. It takes natural language and images as input and outputs robot motion commands — specifically, the next coordinate position for the base, as well as the position and angle of the end effector. But RT-1's limitation lies in generalization capability. Google's subsequent RT-2 model significantly improved generalization, yet its real-time performance was relatively poor, achieving only 1-3 Hz inference — meaning it could generate one to three commands per second, which is insufficient for motion planning. Additionally, it can only be deployed on the cloud, making it costly, and for now remains at the demo stage.

But these efforts do offer us one insight: large models may not yet be suitable for motion planning, and smaller models might still be needed.

We've also seen some companies working in this space. For instance, Covariant, an early FreeS Fund portfolio company, released an 8-billion-parameter small model in March this year that can perform end-to-end motion planning tasks with multimodal inputs including images and language. However, it only handles pick-and-place tasks in specific scenarios. The upside is that this addresses a relatively high-frequency need in industrial settings, and an 8-billion-parameter model is well-suited for local deployment.

So what can large models actually bring to embodied AI? Beyond enhancing the "brain's" capabilities — environmental perception and understanding, natural language task comprehension, and task decision-making — they can also improve certain "cerebellar" functions, such as end-to-end motion planning capability. But this is constrained by motion complexity (dexterous hands), task generality (+ language), latency requirements, compute power, and model scale.

Consider an everyday example: when someone first learns to play tennis, the brain is heavily involved — thinking through how exactly to execute a movement, what posture makes for a more effective swing. After enough training, these movements become muscle memory, and the brain's involvement diminishes.

Of course, some scholars like Fei-Fei Li are attempting to use large models to build 3D world models for embodied AI, aiming to solve the ultimate problems of generalization and universality. This is also very much worth watching.

How Far Is Embodied AI from Us?

After walking through the evolution of embodied AI's three-layer technical architecture, we can see:

The physical body has reached relatively high overall maturity, driven by industrial robotics and autonomous vehicle industries. Still, two directions merit attention:

Higher-precision, lower-cost state perception sensors
End effectors capable of executing more complex tasks

Large models have already proven their capabilities in environmental perception, understanding, and task decision-making. Further improvement here depends on the continued evolution of multimodal large models; and of course, large models have also shown promise in the cerebellum's planning capabilities.

So the core bottleneck today lies in the "cerebellum" — it is simultaneously the intersection of mathematical optimization and data-driven methods, the intersection of software and hardware iteration, and the intersection of virtual and physical entities. This makes the "cerebellum" both the hardest part and the most promising opportunity.

The core of the cerebellum is planning + control. Whether through imitation learning or reinforcement learning, the key to good planning and control algorithms is data. Compared to large language models that can leverage massive internet data, the amount of training data available for embodied AI is far smaller.

The next challenge, then, is how to collect enough data to help agents improve their "cerebellum" capabilities. The main technical approaches currently being explored include teleoperation, simulation environments, learning from human observation, and synthetic data.

07 Investment Strategy Considerations for Embodied AI

Currently, the rapid iteration of (open-source) multimodal large models based on massive image-text data, combined with the maturation of robot hardware and sensors driven by industrialization, has somewhat lowered the barrier to realizing embodied AI. However, constrained by hardware costs, compute power, adoption speed, generality, success rates, and various other factors, the embodied AI industry as a whole remains in an early stage of development, dominated by demos and research showcases, with relatively little commercial deployment.

Take two familiar examples: autonomous driving and humanoid robots. Both concepts emerged as early as ten years ago, and commercial deployment is still being explored today. Autonomous driving doesn't pursue generality but demands extremely high success rates. Humanoid robots pursue generality, but achieving generalized success in the real world remains difficult.

Therefore, in the short to medium term, we need to identify deployment scenarios that can balance both success rate and generality, while also balancing hardware costs, compute power, response speed, and other factors. Looking further ahead, we believe that based on data accumulated in the near term, there will be opportunities to evolve new algorithmic architectures that can raise the value curve of embodied AI and unlock new scenarios.

At the technical level, advances in novel sensors, end effectors, and "cerebellum" capabilities could all lead to product leaps. The corresponding breakthroughs will be key steps toward a future where embodied AI serves humanity.

The diversity of physical-world task scenarios in embodied AI means this direction has room for numerous startups to participate. Here, large tech companies don't have such obvious advantages, because in the specific R&D process, data is core, and whether a product can achieve the closed loop of scenario — data — algorithm iteration is crucial.

For the Chinese market, the sustained cost-efficiency advantages of the supply chain and rapidly growing scenario demand across the secondary and tertiary industries may become persistent opportunities that catalyze the vigorous development of embodied AI. To some extent, based on China's industrial structure and supply chain foundation, the applicability of embodied AI in the Chinese market may even exceed that of large models.

Join the Conversation

What do you think about the present and future of embodied AI? Share your thoughts in the comments.

By 17:00 on May 31, the 3 readers with the most thoughtful comments will receive a FreeS Fund industry research handbook and a copy of What Is ChatGPT Doing... and Why Does It Work?

▲ How Far Is Embodied AI from Reality: Hype, Bubbles, Tech Frontiers, Commercialization

▲ Li Feng in Conversation with Lian Wenzhao: Imaginations and Bubbles of Large Models, the "Impossible Triangle" of Robots, and the Future

▲ Li Feng in Conversation with Zhang Wei, Founder of LimX Dynamics: Humanoid? Robot?

▲ How Far Are We from "Ready Player One"? Technical Challenges of XR from the Perspective of Apple Vision Pro | FreeS Report 33

▲ "The Higher You Stand, the Farther You See": The Origins and Future of the Satellite Communications Space Race | FreeS Report 34

▲ In-Depth: Deconstructing the Shifts and Opportunities in New Energy Vehicle Sales and Aftermarket Services | FreeS Report 35

▲ The Truth About Weight Loss and Innovation Opportunities | FreeS Report 36

Star the FreeS Fund WeChat official account for timely business insights delivered to your feed.