Investment, Technology, Commercialization — Where Is the "AI + Robotics" Wind Blowing? | Yunqi Capital Attent!on · Embodied Intelligence Recap

云启资本·August 28, 2024·19·1

Unlocking a Future You Can Touch

What new changes is AI bringing to the robotics industry? How far is the critical inflection point for embodied intelligence? Real data vs. simulated data — how to choose? What are the business models for embodied intelligence?

On August 20, on the eve of the World Robot Conference, Yunqi Capital hosted its 「Attent!on」AGI+ Salon · Embodied Intelligence Special in Beijing. Together with SEEFund, an early-stage investment firm spun out of Tsinghua University's Department of Electronic Engineering, and with the collaborative support of Beijing Zhongguancun Entrepreneurship Street Technology Service Co., Ltd. and Amazon Web Services, we gathered with nearly 100 industry peers deeply engaged in embodied intelligence to debate the real questions shaping the sector's future.

We've excerpted some highlights from the salon to share with you, hoping to spark more valuable exchange and innovation.

➤➤➤ Around key issues including investment and research dynamics, frontier research, technical breakthroughs, and commercialization, guests engaged in in-depth discussions through keynote speeches and three roundtable sessions.

Notably, the speakers spanned multiple domains across the embodied intelligence ecosystem — investment, cutting-edge research institutions at home and abroad, and the upstream and downstream industrial chain spanning computing, motors, robotic arms, dexterous hands, and humanoid robots. From their diverse perspectives, they addressed the industry questions we all care about, offering as complete a cognitive map as possible.

Selected guests from

Yunqi Capital, SEEFund, China Growth Capital Amazon Web Services, Tsinghua University Institute for AI Industry Research, Beijing Institute for General Artificial Intelligence

Astribot, RealMan Robotics, Galaxy Universal, Songyan Dynamics, Zhaowei Machinery & Electronics

Swipe to see more from the event

「Investor Dialogue:

How Hot Is AI + Robotics?」

AI + Robotics: What Is the Critical Inflection Point? Which Scenarios Can Actually Work?

Yu Chen, Partner, Yunqi Capital

Yunqi first invested in robotics back in 2016, when China's population policies underwent major adjustments, signaling that aging and declining birth rates would become defining demographic features, with labor costs rising as a near-certain trend. Could robots supplement the workforce and take over repetitive or hazardous tasks? With this straightforward premise, we entered the space, first investing in autonomous driving and gradually adding robotics projects. Our investment criteria boiled down to three things: useful, large market, affordable. But technology was constrained back then — computing power, sensors, and other capabilities lagged far behind today's standards, so most robots we invested in were relatively specialized in function.

With significant advances in computing and hardware, we've seen this wave of embodied intelligence investment heat up. Compared to the previous robotics wave, the biggest difference with embodied intelligence is generalization. For example, once a robot learns a task, it can apply that skill across 100 different scenarios, or in the same environment, quickly learn and execute 100 different actions. This is widely seen as the path from the physical world to AGI. Because of this generalization capability, this generation of robots also places less emphasis on specific scenarios. Previously, when we invested in robots, we had to think very clearly about which industry or job it would replace; now we focus more on the robot's own learning capability.

Zhen Zhao, Partner, SEEFund

SEEFund began looking at embodied intelligence in the first half of last year and has made investments in humanoid robots and related areas. Returning to industry logic, large models represent a technical paradigm driven by high-quality data and massive data volumes, potentially allowing intelligence to emerge across many scenarios through scaling law. Can the same logic be replicated in robotics? This is how we approach deal sourcing, and we believe this generation of entrepreneurs can innovate and build along this logic.

On application scenarios, in the long run we must simultaneously achieve closed loops in business workflow, commercial value, and data and foundational R&D. In the medium term, I'm personally more optimistic about scenarios that sit between fully open-ended and highly controlled environments — where problems are simplified, data is more refined, and existing algorithms can be iterated to build complex task-completion capabilities.

Weiming Xiong, Founding Partner, China Growth Capital

Back in 2017, the robotics companies we invested in were more like distributed systems — eyes here, arms there — without achieving true "embodied intelligence." This wave of embodied intelligence enthusiasm is undoubtedly driven by the intrusion of large models. Since 2022, large model progress has accelerated dramatically, bringing an impact to NLP fundamentally different from when BERT and Transformer emerged, truly pushing NLP toward a qualitative leap. How do we compress large models into robotics applications? Our thinking is, even if we can't reach full embodiment in one step, we can start by combining with specific commercial scenarios. For instance, taking scenarios where overseas markets are lacking or underdeveloped and building them domestically. This is the approach we're currently pursuing in our deal sourcing.

Competitive Landscape Outlook:

Big Fish Eat Small Fish or All-Out Competition?

Weiming Xiong, Founding Partner, China Growth Capital

Looking back at the development of smartphones and automobiles, there were massive entrepreneurial opportunities during the "capital flooding" phase, but ultimately only the top-tier talent and top-tier capital could fight the battle from start to finish. I think hardware companies may follow a similar pattern. At this stage, the real competitors may not even have appeared yet. What's most important is clearing the path to commercialization — finding your business model as quickly as possible. While we emphasize the generalization capability of embodied intelligence, finding scenarios where the numbers work and opening up commercialization step by step is crucial.

Zhen Zhao, Partner, SEEFund

Everyone is still in exploration mode now; in the long run, there won't be many winners. At the current stage, hardware-software synergy at the system level is very important. In the medium term, the critical bottleneck is data — whoever can acquire more and better data has a chance to seize the opportunity. From there, expanding upstream or downstream in hardware or applications could secure an advantageous position. Additionally, from an application perspective, B2B and B2C scenarios may diverge significantly in capability requirements. If B2C enters homes or provides emotional value, the demands for intelligence and generalized task execution are extremely high, naturally creating a large gap from B2B.

In the long run, under the underlying assumption that scaling law can push intelligence to excellent levels for a considerable time, scenario, data, and system capability are what determine the endgame.

Yu Chen, Partner, Yunqi Capital

Costs remain high, so in the short term there won't be much substantive B2C application of embodied intelligence. Most players are now exploring applications in factory automation — for example, flexible manufacturing where one robot handles operations across different workstations. The advantage of going this direction is having users willing to help you refine your product and technology. This is a way to generate your own blood supply, allowing you to survive until the day when technological development drives costs down rapidly and enables you to push into broader markets. In entrepreneurship, survival is paramount. You need to find investors who will walk the journey with you and provide ammunition, but also a good customer base willing to help you refine your product and accumulate technology.

「Industry-Research Collision: How Do Space and Intelligence Translate to Reality?」

What Was the Biggest Inflection Point for Embodied Intelligence in the Past Year?

Peiqing Xiao, Senior Startup Solutions Architect, Amazon Web Services

Over the past year, we've observed that the embodied intelligence ecosystem has become increasingly vibrant. Beyond companies building the physical robots themselves, we're seeing more and more startups in data and middleware layers — for example, in data, simulation and VR-based data generation; in middleware, PaaS and development platforms that assist with robot operations and provide SDKs.

Ke Fang, Co-founder and CFO, Astribot

Over the past year, large models have advanced from pure language models to multimodal models, and in the last six months we've seen spatial models emerge in the US. Spatial intelligence or spatial models are critically important technologies for embodied intelligence.

Looking further out, embodied intelligence is essentially the vehicle through which spatial models interact with the world. In the near term, embodied intelligence is an important means for spatial intelligence to extend into the physical world and collect data, and a major research direction for embodied intelligence over the next three years.

Siyuan Huang, Head of General Vision Lab, Beijing Institute for General Artificial Intelligence

My own research focuses on 3D scene understanding and embodied intelligence. A major trend has been the maturation of components like joints and motors, and the maturation and stability of robot hardware bringing us closer to the dream of building general-purpose robots. But the full potential of this hardware is still far from being realized.

How to use models to make robot bodies move more smoothly, and whether a unified model can enable different bodies to perform similar tasks — these remain open problems.

Technical Path Selection and Bottleneck Breakthroughs

Siyuan Huang, Head of General Vision Lab, Beijing Institute for General Artificial Intelligence

Personally, I don't believe there's a golden rule for technical path selection — it depends on the specific application and scenario's requirements for generalization. I often use two axes for evaluation: one is generalization, the other is the dexterity required by the task. These two axes largely determine how algorithms need to be designed at this stage. For example, some current designs use large models connected to various smaller models — this works for solving specific application needs, but from a technology development perspective it feels more like a stopgap measure for the current stage. I believe this kind of modular algorithm will eventually be replaced by more fundamental models.

A second point about simulated vs. real data. From the perspective of robot action libraries, simulated data can help robots learn some difficult skills that are hard to acquire from data. But models trained in simulation still have significant limitations in generalization performance, so the direction I ultimately want to promote is using more real data combined with simulated data for embodied intelligence training.

Hao Zhao, Assistant Professor, Tsinghua University Institute for AI Industry Research

Let me share some thoughts on simulated and real data. For our generation of researchers, there's a consensus that data matters — products with more data will necessarily be better than others. In the large model era, I don't think we've moved beyond this paradigm. Everyone is still collecting better data and training with larger models. The hardest part of simulated data for embodied intelligence is that it requires geometric relationships, so 3D reconstruction and simulation have become more important in this era.

There was a paper recently saying pure simulated data can make large language models "dumber" — this seems clear to me as well. Training entirely on simulated data definitely needs real data to trigger it; the two need a multiplicative relationship. Only then does the resulting data solve real problems. So real data and simulated data must be combined.

Peiqing Xiao, Senior Startup Solutions Architect, Amazon Web Services

Current VLA models are definitely not the final form, because there isn't enough data. We can draw an analogy to autonomous driving — we're at the earliest stage of embodied intelligence. Tesla actually wants to do end-to-end, but I think what they've built is essentially a VLA model. Its mechanism is: see things through cameras, process through various neural networks in between, then output actions like accelerating, steering, braking. FSD already works very well now because Tesla has so much data — every owner is providing data for free.

But embodied robots can't yet access this much real data, which is why simulated data is valuable. And simulated data also has its own Scaling Law — the more simulations you run, the higher the success rate when transferring to real environments.

Sensor Evolution Directions

Siyuan Huang, Head of General Vision Lab, Beijing Institute for General Artificial Intelligence

From a vision perspective, I believe 3D understanding is the most important. This can be achieved either through multi-view understanding or through depth cameras and LiDAR — these are respective path choices for different manufacturers. Tactile sensing also has great potential for improvement. Currently, robots performing very difficult actions need precise force sensing and force control. We're seeing some products with excellent motors now.

One direction we're watching is whether we can build a highly generalizable, industry-agnostic foundational sensor that could help us construct more powerful perception systems for more general-purpose humanoid robots in the future.

Ke Fang, Co-founder and CFO, Astribot

When building robots, we place great emphasis on multimodal data. Beyond vision, we pay close attention to force and tactile sensing, including designing transmission structures with careful consideration of force and tactile applications across scenarios. I strongly agree with the view that 3D vision has limitations in understanding the real physical world. In many operations, vision only serves a guiding role. So our underlying logic for how to help robots better understand the physical world is: be like humans. Humans comprehend the world through the integration of vision, force, and touch. Thus multimodal perception is an important research direction for future robots.

In sensor design, we need to consider integration with the entire control loop. For tactile sensing, for example, beyond sensitivity we care deeply about tactile frequency — if tactile frequency doesn't match the control loop, you get a situation where by the time tactile feedback is transmitted, the object has already been dropped.

「Upstream and Downstream Perspectives: How Far Is Embodied Intelligence from Commercialization?」

Landing Strategy: Generalized Scenarios or Specific Scenarios?

Zhizheng Zhang, Partner, Galaxy Universal

This needs to be considered at two levels. First, should the product land in a specific scenario first? Second, if we want to achieve high product capability in that specific scenario, should we only learn from that scenario?

On the first question, our answer is yes. But to achieve high product capability in a specific scenario, learning only that scenario's data won't get you there. In AI model training, to achieve true generalization, the foundation model must reach a certain level of generality first — only then can it draw inferences and achieve closed loops. Analogous to human learning: general education comes first, then specialized education. The same applies to embodied intelligence models — if you want to handle various complex states and unstructured scenarios in one domain, you absolutely cannot train using only single-scenario data.

To summarize, the robots we want to build should be "experts" who complete general education before advancing to domain-specific knowledge, capable of replacing certain professions and handling specialized tasks in one field. And to pursue true generalization and universality, during the construction phase we must train with data from different domains and industries.

Shipu Zhang, Co-founder and CEO, Songyan Dynamics

On the data front, under the new era's scaling law, how to effectively collect data is crucial. If we plot data quantity on the x-axis and data accessibility on the y-axis, we see players like Tesla and Google collecting data 1:1 along the x-axis, while some researchers focus on interfaces first, avoiding heavy collection costs and working along the y-axis on collection solutions.

But now everyone still needs to mentally draw another graph: where exactly does scaling law converge? Different tasks converge at different speeds, which relates to the commercial scenarios entrepreneurs want to pursue. You may need to combine this with your existing resources to consider which scenarios are easier to commercialize, and the corresponding scaling law convergence speeds differ. In the current capital environment, you should find good data collection methods to accelerate PMF iteration.

How to Find Product-Market Fit Scenarios?

Suibing Zheng, Founder and CEO, RealMan Robotics

One selection criterion is: scenarios where robots create value in application while conveniently collecting data along the way.

For example, new retail, power utilities, construction, automotive — in scenarios that both generate value and build robot capabilities, collecting data is an excellent approach. Rather than collecting data by deploying large numbers of demos or using massive human labor for training. Of course, in the initial stage, building basic operational capabilities is essential.

In the specific data selection process, I believe there's no need to pursue comprehensiveness. Similar to how Tesla FSD, in the process of meeting electric vehicle adoption, first collected general perception and operation data before making improvements. In short: first achieve basic human functionality, then surpass it.

Yidong Chen, Chief Electrical Control Expert, Zhaowei Machinery & Electronics

Dexterous hands are an extension of our product capabilities — we want to truly solve the "dexterity" of hands: flexible, lightweight, quiet, reliable, and affordable. With this in mind, we deeply explore this track by combining leading customer needs with our own capabilities.

From a data perspective, the key to empowering products is making data for each细分场景 as precise as possible. Whether you can combine scenarios to create unique, optimal solutions — this is the essence of commercial success. In our industry, across all tracks, what ultimately survives are professionals who let professionals do professional things.

End Effector Form: Multi-function "Swiss Army Knife" or Hand That Operates Multiple Tools

Suibing Zheng, Founder and CEO, RealMan Robotics

Some specific scenarios may only require end effectors with one degree of freedom, or even zero degrees of freedom. So end effector forms are diverse. But if we're training robots for various household services, a hand capable of holding multiple tools is more suitable. In summary: different products for different needs.

Zhizheng Zhang, Partner, Galaxy Universal

A Swiss Army knife is versatile and integrates various functional components, but it's not convenient to use — and the knife we use most in daily life isn't a Swiss Army knife. For embodied intelligence, with current hardware forms and end effectors not yet standardized and the industry lacking data, we firmly believe the core approach should be "develop human-like capabilities first, then superhuman capabilities" — first let robots learn like humans to use various tools. Only when forms are as standardized as possible and capabilities as close to human as possible can we efficiently and rapidly accumulate data and improve product quality.

On this foundation, when our data accumulation reaches a certain level and foundation model capabilities reach a certain threshold, with clear demand we would consider Swiss Army knife-style designs, or even go beyond to explore more imaginative mechanical designs, enabling robot capabilities to evolve from "human-like" to "superhuman." We believe this is a more pragmatic, responsible approach to the industry.

Yidong Chen, Chief Electrical Control Expert, Zhaowei Machinery & Electronics

To indulge in some speculation: in the not-too-distant future, with deep integration across upstream, midstream, and downstream of every industry chain, I believe we can achieve symbiosis between humans and embodied intelligent robots. End effector selection remains scenario-dependent — what matches and satisfies demand is what makes sense.