Cewu Lu on Embodied Intelligence: The Data Curse, First Principles, and the Two-Stage Rocket Large Model | Gaorong Ventures

高榕创投高榕创投·July 22, 2024

Starting from first principles, breaking the data curse in embodied intelligence.

This June, Cewu Lu, a professor at Shanghai Jiao Tong University, released a video demonstrating an embodied intelligence "brain" taking direct control of a robotic arm to shave his own face. Behind this seemingly simple task lay enormous challenges — the embodied intelligence large model, embedded with high-precision force feedback modules, had to make split-second decisions about pressure and tangential force adjustments based on the professor's head movements, delivering a clean shave without causing injury.

At the July 2024 World Artificial Intelligence Conference, Lu and his team further demonstrated how this embodied brain empowers physical robots to accomplish a series of complex tasks. For instance, a crumpled shirt tossed casually onto a surface could be folded neatly by a dual-arm adaptive robot — this general flexible object folding skill (AnyFold) involves manipulating objects with infinite degrees of freedom, requiring object understanding comparable to human capability.

Similar to the shaving demonstration, the team also showcased general object surface shaving skill (AnyShave) through cucumber peeling. The embodied brain enables robots to operate on irregular curved surfaces with precision that even exceeds human levels.

Lu has spent years working in embodied intelligence, robot learning, and computer vision. He is a Changjiang Scholar Distinguished Professor, a 2023 recipient of the Xplorer Prize (currently the only one in the embodied intelligence field), and chief AI scientist at Flexiv, a general-purpose intelligent robotics company. In 2023, Flexiv strategically incubated Noematrix, an ecosystem company focused on large-model embodied intelligence technology, with Professor Lu serving as co-founder.

At this year's WAIC, the Noematrix Brain made its public debut, performing a series of real-time demonstrations and live interactions using Flexiv's adaptive robots, comprehensively demonstrating its generality and robustness. Lu also delivered a speech titled The Noematrix Brain and Scaling Laws for Embodied Intelligence, sharing his views on the key elements, commercial products, and future trends for developing embodied intelligence.

In the AI large model field, the Scaling Law represents an important empirical finding: model performance consistently improves as model size, data volume, and training time increase. OpenAI validated the success of Scaling Laws in language and vision large models through ChatGPT and Sora.

So, has the embodied intelligence field found its own "Scaling Law"? How can embodied intelligence large models achieve broader real-world applications?

Lu believes that we cannot simply replicate the language model Scaling Law approach to build embodied intelligence large models, because embodied intelligence operates in a massive data space where extreme uncertainty and data collection difficulty create a "data curse" at the current stage.

He then approached from first principles, proposing a two-stage rocket solution built around the "Physical World Foundation Model" and the "Robot Behavior Foundation Model." By chaining these two stages together for end-to-end joint training, the growth slope can be significantly increased, making training low-cost and scalable.

The following is Professor Cewu Lu's presentation:

As we know, embodied intelligence is an intelligent system that perceives and acts through a physical body — learning, deciding, and acting through interaction between the agent and its environment. Because current language and vision large models lack data from actual physical interaction between bodies and the world, they do not fully capture the physical world laws required for embodied intelligence research. Simply increasing data volume on top of these models cannot satisfy the needs of embodied intelligence development.

So what if we replicate the language model Scaling Law and massively fill in end-to-end "vision" to "control" data for model training? Could we obtain an embodied intelligence large model with sufficiently superior performance?

The answer is that even if this is a logically sound path, it currently faces many bottlenecks. The biggest issue is the vastly different levels of data acquisition difficulty. Looking at the evolution of past language and vision large models, we see that the booming internet provided massive amounts of visual and linguistic data — data accumulation was essentially a crowdsourced, universal behavior.

However, embodied intelligence data requires 1:1 collection, and its data space is enormous, generating prohibitive costs atop massive data demands. Take autonomous driving, which similarly requires collecting "vision" to "control" data: over the past three years, roughly 100,000 vehicles equipped with advanced simulators collected this type of data, barely reaching a somewhat usable level. Yet in terms of operational, scenario, and simulation complexity, embodied intelligence agents (such as general-purpose robots) face at least several dozen times more uncertainty compared to autonomous vehicles.

This enormous uncertainty makes the required data space for embodied intelligence massive, creating a data curse. Therefore, when trying to effectively and rapidly advance embodied intelligence, we can step outside the "path" itself and approach from first principles to consider what factors are critical for completing embodied intelligence tasks.

From first principles for embodied intelligence large models: first, it must understand the physical world, knowing "what the world is"; second, it must know "how to decide" in order to demonstrate sufficiently robust behavior. Combined with language and vision large models for pre-training or assistance, through joint training on operation-related physical commonsense and behavioral decision-making with force feedback embedded in the agent process, embodied intelligence can grow rapidly.

From this we built two large models that can be viewed as a two-stage rocket for advancing embodied intelligence:

The first stage is the Physical World Foundation Model, which enables robots to acquire commonsense, low-dimensional operational physical representations during training, thereby understanding objective physical facts and aligning with human concepts. The second stage is the Robot Behavior Foundation Model, which fully couples operational physical commonsense representations with the high-precision force feedback capabilities of the agent (robots, for example), thereby making human-like force-position hybrid behavioral decisions with excellent robustness and generality in operation.

When the two-stage rocket is chained together for end-to-end joint training, data volume requirements drop substantially and the growth slope becomes more pronounced, making training sufficiently low-cost and scalable.

To continuously train the Physical World Foundation Model, we need to effectively acquire object operational structure data.

On one hand, we discovered the duality between human hand operations and object embodied knowledge, so we built a human hand operation learning platform. By observing large amounts of hand operations, we can discover operational representations that help the model acquire operational topological structure commonsense.

On the other hand, an effective virtual environment that simulates the real world and supports physical interaction is essential. We developed our own embodied intelligence simulator RFUniverse (RSS 2023 & IROS 2022 Best Paper). Combined with a series of machine learning techniques, RFUniverse can simulate the physical world at 500x speed with under 1mm error, making simulation scenarios closer to real physical laws while enabling large models to understand commonsense in a task-centric manner, achieving coupling between simulation and learning.

At this WAIC, our mechanical arm demonstration of clothes folding showcased top-tier task-centric physical commonsense understanding capability. In the AI world, understanding of manipulation objects increases with their degrees of freedom: completely rigid bodies are six-dimensional, articulated bodies are 6+k dimensional, but flexible objects like clothes have infinite-dimensional degrees of freedom. Therefore, completing folding operations from arbitrary initial clothing states requires a massive breakthrough in object and operational commonsense understanding. This research made us the first Chinese team in history to receive a Best System Paper nomination at the top international conference RSS 2023, and we believe we are also the first team globally to present complete real clothes folding at a public exhibition.

Based on understanding of operational physical commonsense, we also need to acquire sufficient force-position hybrid operational data. Traditional position-control large models only need position information, but operation without force makes end-effector manipulation insufficiently robust and general.

Currently, we are using various combined data solutions and equipment, such as a globally unique high-precision force teleoperation platform to acquire high-precision aligned force-position hybrid data, achieving "Pao Ding's dissection of an ox" level precision. We also built a mechanically fully mapped exoskeleton data collection platform that operators can wear anywhere, enabling convenient, scalable, low-cost source data collection.

Using these data generation solutions as tools, we participated in building the largest open-source real robot dataset to date, the Open X-Embodiment Dataset, which already contains over one million real robot trajectories across 22 robots, frequently cited by many authoritative figures — we welcome everyone to use it.

Built atop all these technical accumulations shared above, we officially released at this WAIC a general-purpose brain for embodied intelligence: the Noematrix Brain.

The Noematrix Brain features a full-stack embodied intelligence technology framework, providing "force-centric" embodied intelligence foundation models (the Physical World Foundation Model and Robot Behavior Foundation Model), the AnySkill atomic skill library, foundational software framework, and related developer toolchains. It can organically integrate with various robot platforms and even industrial equipment, helping robots easily master more skills and achieve more applications.

Beyond the brain itself, at the practical solution level, we can provide customers with highly generalizable, reusable hardware-software integrated platforms. Through modular combinations of different hardware forms, we meet actual needs across various scenarios.

Additionally, based on the Noematrix Brain, Noematrix provides a continuously expanding robot atomic skill library AnySkill, thereby giving intelligent agents general manipulation capabilities. The general grasping skill AnyGrasp, first released in 2021, is representative. At its initial release, AnyGrasp was already unrestricted by object type or flexibility, capable of directly grasping unknown objects with extremely fast detection speed — for the first time globally, it enabled robot grasping speed to reach human level. Through continuous optimization, AnyGrasp now possesses dynamic object grasping, high-precision force-aware grasping, diverse texture processing, and other generalization capabilities.

AnySkill, in my view, is actually a form of Scaling Law by skill. By pushing the robustness and generality of foundational skills to 99.X%, it produces a capability leap that can be observed as qualitative growth in performance. Since the vast majority of human task completion comes from combinations and permutations of foundational skills, AnySkill can use the most streamlined set of atomic general skills, combined in diverse ways and assisted by language and vision large models, to support rapid development across various scenarios.

In the future, through continuous growth of unified models and atomic general skills, the commercializable tasks we can unlock will multiply exponentially, until the unified model forms a skill space where all skills are sufficiently general to cover all industries.

When agents are empowered by embodied intelligence, they can become human assistants across many industries: tedious tasks like installing a screw on an industrial production line, dangerous missions like disassembly and demolition in extreme scenarios, delicate work like household chores, cooking, and patient care closely tied to daily life...

We will continue using technology to advance the industry, looking forward to this day arriving soon.