Yunqi Capital Research Report | Embodied AI Deep Dive: Core Pain Points and Path to Commercialization

云启资本·October 29, 2024·31·2

How Far Is the Path to General AI?

Since 2024, "embodied intelligence" has been a recurring headline in tech circles, with so many concept products popping up at industry conferences that attendees are practically getting whiplash.

But what separates these products under the hood? Which ones are closer to the kind of step-change embodied intelligence that actually matters? And how far are they from real commercial deployment? Long after the spotlights dim, research and practice around these critical questions has continued.

Embodied intelligence is a domain Yunqi Capital has been deeply immersed in for years. This edition of Yunqi Research Insights distills the essential findings from recent research by the Yunqi Capital team. We'll start from the technical approaches to embodied intelligence and explore its core pain points and path to commercialization.

Key Takeaways

From intelligence to physical form, multiple links in the embodied intelligence value chain still need to be filled in, and achieving data closed loops across different algorithms and physical embodiments is a critical pain point.
Different technical approaches directly shape commercialization scenarios and timelines, which in turn affect the formation of data closed loops.
The key to embodied intelligence commercialization is ROI optimization and declining learning costs for new tasks.
Deployment scenarios will likely progress from low generalization / low failure cost to high generalization / high failure cost.
A general-purpose embodied intelligence solution won't emerge soon; rapidly identifying scenarios, establishing data closed loops, and effectively managing technology iteration risk are the critical paths.

As AI technology evolves, embodied intelligence has become the next physical vessel carrying the vision of intelligence beyond traditional robotics. Capable of perceiving environments, understanding tasks, and executing autonomous perception-planning-decision-action loops, it places greater emphasis on generalization than vision-centric traditional robots, and better aligns with human needs for interactivity, flexibility, and adaptability across scenarios. It has thus become one of the most closely watched sectors in tech venture capital in recent years.

Current exploration remains in its early stages, and industry technical approaches have yet to converge. Various players are feeling their way forward along multiple paths, iterating technology and probing commercial applications in limited but differentiated scenarios. How will this costly exploration converge toward genuine embodied intelligence? Every participant is trying to answer this.

We believe different technical approaches directly affect commercialization scenarios and timelines. This article starts from embodied intelligence technical approaches to analyze its core pain points and application deployment paths; then returns to the present, examining the barriers that must be overcome based on the technical route choices of startup players.

I. Technical Approaches to Embodied Intelligence

1. By Algorithm Architecture

Hierarchical Models & End-to-End Models

Hierarchical models decompose tasks into multiple neural networks at different levels, trained separately and connected in pipeline fashion. In current practice, most robotic manipulation uses hierarchical models. Take Figure as an example: the upper layer uses OpenAI's multimodal large model for visual reasoning and language understanding; the middle layer relies on neural network policies for motion control and action command generation; the bottom layer receives action commands for execution control.

Hierarchical model schematic. Source: Yunqi Capital team research

End-to-end models complete the entire process from task goal input to behavioral command output through a single end-to-end neural network. Traditional control algorithms struggle to efficiently address complex scenarios and long-tail problems, giving end-to-end models greater room for imagination.

But end-to-end models face challenges around data volume, compute, and other existing resources. For instance, robot end-effector operation frequencies are very high, yet current edge computing power cannot support real-time, high-frequency computation demands. We believe that setting aside regulatory impacts on deployment, end-to-end model technology convergence to L4 autonomous driving standards will take at least five more years.

End-to-end model schematic. Source: Yunqi Capital team research

2. By Learning Method

Model training methods include supervised learning, unsupervised learning, reinforcement learning, imitation learning, transfer learning, and others, typically requiring large amounts of labeled training data.

Among these, reinforcement learning is currently one of the most mainstream robot control model training methods, referring to robots' continuous interaction with environments, repeated trial of different actions, and gradual learning of optimal behavioral strategies based on received rewards or penalties.

Transfer learning is another important training method adopted in practice, referring to applying knowledge and patterns learned in one task or domain to other related but different tasks or domains. Take Google's RT-1, which uses this method: RT-1's hardware had never learned box grasping, but training RT-1 on KUKA robotic arm box grasping data enabled RT-1 to preliminary master box grasping skills.

Technical approaches by learning method. Source: Yunqi Capital team research

3. By Data Source

Embodied intelligence data mainly falls into robot datasets and human datasets.

Robot datasets typically include data related to robot perception, motion, and environment interaction. Such data can provide broad strategies and behavioral patterns for robot training, but their acquisition requires substantial time and resource costs, plus complex data annotation.

Human datasets refer to training robots by collecting human actions, poses, and object interactions. Such data has strong generalization and performs well in relatively fixed and standardized scenarios like factory production lines and home environments, but places higher demands on algorithm and robot hardware form coupling, potentially making subsequent iteration more difficult.

From a data source perspective, technical approaches can also be divided into real datasets and generated datasets.

In theory, real data collected directly in actual scenarios is the most straightforward training dataset. But real data meeting training precision requirements is difficult and costly to obtain, while generated data is relatively low-cost and easily obtainable at scale. There are two paths for generated data: first, collecting robot data through teleoperation or transfer learning methods; second, using simulators to provide simulation environments for robots, generating massive data at low cost.

Technical approaches by data source. Source: Yunqi Capital team research

II. Pain Points and Commercialization Path of Embodied Intelligence

1. Core Pain Points

Large-scale, high-quality real-world data is one of the most important factors for embodied intelligence to achieve intelligence. Forming a flywheel effect of product capability, data scale and quality, intelligence level, and commercial value is key to ultimate deployment. Viewed through this lens, achieving data closed loops across different algorithms and physical embodiments is one of the core pain points of embodied intelligence. The main current difficulties in reaching this closed loop lie in data heterogeneity and hardware heterogeneity.

1) Data Heterogeneity

Data heterogeneity means collected and processed data has diversity in type, source, structure, and other dimensions. This implies that even though the robotics field has accumulated abundant open-source data, datasets are difficult to use jointly—specific data must be collected for each robot, task, and environment.

From this perspective, startups with substantial proprietary data have certain competitive advantages. Additionally, using simulators, efficient data collection equipment, and other tools are also paths to compensate for data gaps. But current simulators still struggle to meet industrialization conditions; simulation environments and the real world have a Sim2Real domain gap—even 100% success rates in simulation don't guarantee perfect transfer to the real world.

2) Hardware Heterogeneity

Hardware heterogeneity mainly refers to embodied intelligence hardware not being standardized; different hardware robots vary in degrees of freedom, end effectors, motion controllers, and workspace configurations, among other physical properties. Algorithm performance suffers precision loss across different hardware.

3) Startup Closed-Loop Paths

To quickly collect data and get as close as possible to algorithm-embodiment data closed loops, we observe that most startups currently choose joint optimization of algorithms and physical hardware. But in the medium to long term, we still expect to see gradual decoupling of software and hardware, with hardware barriers declining. Different application scenarios require different hardware architectures for efficient service, so algorithms need to be transferable across different embodiments, capable of establishing data closed loops across different algorithms and embodiments, with data-driven algorithms becoming core competitive advantages.

2. Commercialization Markers and Deployment Path

Current embodied intelligence solutions rely on large-scale data training, so in short-term applications like industrial sorting, their ROI is lower than traditional industrial robots. But in the long term, data and model-driven embodied intelligence has outstanding scale effects, with marginal costs of intelligence improvement gradually declining, gradually becoming competitive with existing solutions.

In short, ROI optimization and declining learning costs for new tasks are important markers of embodied intelligence commercialization capability. This requires efficient data collection in specific scenarios to continuously iterate and converge embodied models.

Considering different scenarios' generalization levels and failure costs, we believe embodied intelligence scenarios are initially suitable for deployment in research settings with low generalization requirements and low failure costs, gradually expanding to industrial and commercial services over the medium to long term, with home service as the endgame vision.

At different stages, embodied intelligence operation tasks, hardware forms, and other aspects also differ.

- Short-term deployment scenario characteristics: Single tasks, with effective generalization within operational tasks (operational object generalization); relatively high scenario fault tolerance.

- Medium to long-term deployment scenario characteristics: Algorithms gradually mature, hardware forms stabilize, algorithms and hardware begin decoupling in simple scenarios, capable of generalizing operated and operating objects in some environments. Model architecture gradually shifts toward end-to-end.

Taking commercial service scenarios as an example: in the short term, embodied intelligence is suitable for simple tasks in closed/semi-closed environments with certain generalization requirements, reducing interaction frequency with humans, replacing small portions of human work, such as retail inventory checking and restocking. In the medium term, it evolves to relatively complex tasks with generalization requirements, interacting with humans in shared spaces, such as food delivery. In the long term, it can handle tasks with higher professional requirements and complexity, coexisting with humans in shared spaces, such as public services.

In terms of form, there is some probability of evolution from wheeled chassis + robotic arm + gripper toward commercial cleaning robot + robotic arm + dexterous hand (industrial/commercial scenarios) and wheeled/bipedal + high-DOF robotic arm/dexterous hand (home service scenarios).

III. Technical Route Differences Among Startups

1. How Does Route Choice Affect Business Model?

As mentioned earlier, different technical approaches directly affect embodied intelligence commercialization scenarios and timelines. Currently, active market "players" differ in product form and deployment path choices, which are closely tied to their technical approaches. We believe that while technical approaches have no long-term moat, short-term technical route choice affects algorithm-hardware coupling requirements, which in turn affects company business models and progress.

Taking different algorithm architectures as an example: most startups currently use hierarchical models, decomposing tasks into multiple neural networks at different levels trained separately and connected in pipeline fashion. The more independent neural networks, the higher the hardware joint training requirements, and the harder to decouple. Therefore, companies on this technical path must deeply bind hardware and algorithms during training.

Consequently, such companies mostly use business models of robot joint shipment, with some having very high in-house hardware development. Early deployment scenarios also focus on low generalization requirements, high operational precision niche areas. Figure, for example, self-develops its hardware architecture and some core components, initially targeting industrial assembly applications.

For companies choosing the end-to-end model route, joint hardware is not mandatory; some startups on this route use pure software delivery business models, which currently appear relatively distant from commercial deployment.

2. Starting from an Imperfect Line, How to Go Further?

In the short term, an embodied intelligence solution adaptable to all scenarios and operations is unlikely to emerge, and 60-70% operation success rates are still far from deployment. Bridging this gap depends on breakthroughs at critical bottlenecks across algorithms, data, hardware, and multiple other elements.

Taking data as one element: identifying differentiated deployment scenarios to enable earlier deployment and data collection, then gradually reducing marginal costs in learning new tasks and capabilities, continuously approaching data and model-driven scale effects, is an important path to unlocking embodied intelligence's core value. Meanwhile, when facing technical route switches, finding effective paths to manage resulting data migration costs is also critical.

Beyond this, algorithm architecture iteration, compute resource evolution, improvement in embodiment-algorithm coupling, hardware cost reduction and efficiency gains, and other factors are all important topics on embodied intelligence's commercialization path. Around these issues, we will continue thinking and practicing alongside embodied intelligence research and commercial progress.

(: Welcome those of you also closely following embodied intelligence to connect with us