How Far Can "Embodied AI" Take This Robotics Boom? | 5Y Capital Tavern Vol. 17 x Silicon Valley 101

五源资本·June 3, 2024·35·0

What new technologies are emerging in the robotics industry with AI, and what challenges still lie ahead?

The combination of multimodal large models — spanning vision, language understanding, and more — with simulation training technologies has been like a shot of adrenaline for robots, making them increasingly intelligent, capable, and human-like. The emergence of "embodied AI" means machines are no longer merely passive computing devices, but active agents that can interact with the physical world.

From Tesla's Optimus to Boston Dynamics' Atlas, the progress of these humanoid robots seems to herald the end of Moravec's paradox — the historical difficulty robots faced with perception and motor tasks is gradually being overcome. As technology costs fall and maturity rises, is commercial application of humanoid robots close at hand? In this race, which players can seize the initiative? What new tools and technologies are reshaping the robotics industry under AI's influence, and what challenges still lie ahead?

In this episode of the 5Y Capital Tavern, in partnership with SV101, 5Y Capital's Peter sits down with Lily, NVIDIA's head of robotics business in China, and Jun Hong, founder of SV101, to share their perspectives on the robotics industry.

Scan the QR code above or follow "5Y Capital Tavern" on Xiaoyuzhou to listen to this episode.

Selected excerpts from the conversation

How the robotics industry has changed over the past year

Jun Hong: What's changed in the industry over the past year? From my observation, whether in terms of funding or topical interest, robotics has been getting hotter.

Peter: Looking back at the past year, I think we've seen several categories of change in robotics. First, advances in large models — whether traditional language models or vision-language models — are being applied to robotic decision-making and cognitive tasks, greatly expanding the constraints that traditionally limited robot control and decision-making, and giving people a glimpse of rapid technological progress.

Second, progress in robotics technology itself: in control, simulation, imitation, and learning, the field has made significant strides. For example, NVIDIA's simulators and training environments allow robots to complete底层控制和仿真 more quickly through imitation learning or reinforcement learning, whether for quadruped, biped, or more complex robot forms.

Third, investment from industry giants and startups. Led by Tesla, major players have begun pouring significant resources into humanoid robots and embodied AI R&D. Numerous startups have also emerged in China and overseas, leveraging mature hardware supply chains to develop various legged robots or new robot forms. Some exciting products and demos have come out of this.

Jun Hong: Much of the current attention on robotics stems from what large models can bring to the table. In other words, how have advances in large models affected the robotics industry?

Peter: Let's first break down robot capabilities or tasks. A classic framework divides robots into perception, planning or decision-making, and control. Simply put, a robot needs to understand its environment and external inputs, then based on that understanding and its own state, conduct short-term and long-term planning and prediction. At the hardware level, actuators need to execute that planned control.

The most direct impact comes from language models, which possess strong commonsense and reasoning capabilities that can influence the upper levels of robot planning and control. This arguably began with Google's PaLM-SayCan in 2022, combining large models with robot execution. By 2023, vision-language models like PaLM-E achieved the integration of perception and planning. By late 2023, Google introduced models like RT-1 and RT-2, using Transformer architectures to let large models drive and implement the control layer as well.

The trend we're seeing is that large models are not only participating in upper-level robot planning but are also beginning to enter the底层 tasks of robot perception and control. This is the first time in robotics' long history that we've seen such a change — learning algorithm-driven, with a very large pretrained model pushing control forward. This is the biggest technical breakthrough we've seen in the past year.

Jun Hong: You mentioned PaLM-E and RT-1, RT-2 — they've actually been in development for some time. Why is the timing right now? Is it that from platform research to actual deployment takes a while?

Peter: Products from Google Research like PaLM-E represent very novel ideas. PaLM-E's biggest feature was combining Google's traditional language models with robotic embodied capabilities. Before this, people hadn't tried combining robots with large language models' upper-level reasoning — it was a gradual development process. Just as Transformer technology appeared in 2017, GPT-3 didn't emerge until 2020, and GPT-4 until 2023; technology takes time to mature and spread.

The challenge in robotics is that physical entities and testing environments aren't as convenient as pure language models. So when these models are actually applied to robotic scenarios, it takes time to implement. RT-1 and RT-2 are also results of Transformer technology gradually diffusing. After Transformer matured, we first saw Video Transformer approaches combining video with Transformer encoding. For RT-1 and RT-2, the more important development was combining robotic execution actions with the Transformer framework — a direction that only began to see active research and experimentation in the past year.

How do you train robots?

Jun Hong: There's a popular term lately: "embodied AI." Could you briefly explain the difference between embodied AI, humanoid robots, and robots in general?

Peter: Interactive selfhood is an important characteristic. When we discuss embodied AI, its opposite is non-embodied AI, or what we traditionally call software intelligence or algorithmic intelligence. The core meaning is that before embodied AI was defined, there was a different set of intelligence theories holding that intelligence could be achieved through symbolic computation and algorithmic processing without physical embodiment.

Traditional AI systems typically lack physical embodiment — they don't need to interact with physical entities to achieve desired outcomes. But embodied AI must have a physical or machine entity that interacts with the environment. Intelligent behavior derives not only from fixed algorithms but from interaction with the external environment.

When we mention embodied AI, we're emphasizing that it has an entity and that this entity must interact with the external environment. We find that whether humanoid robots or traditional robots, some traditional robots don't need autonomous responses to the outside world — they may be automated equipment repeatedly executing preset programs. As AI technology develops, embodied AI has greatly expanded the traditional definition of robots. Humanoid robots are an extension of that traditional definition — more human-like, more flexible, and more general-purpose.

Jun Hong: When people talk about robots, many types come up: quadruped robots, biped robots, robotic arms, and humanoid robots. Could you explain the technical differences between these robots and their common characteristics?

Peter: Robotics technology can be imagined as a spectrum, from specialized, automated, highly repetitive machines to general-purpose, flexible, humanoid-like robots. At the far left of the spectrum are automated robotic arms executing repetitive tasks, relying mainly on底层 control like repeated joint movements. At the far right are general-purpose humanoid robots that need to perform complex tasks requiring high flexibility and autonomy.

From a technical perspective, the three core modules of robots are perception, planning, and control. Specialized robots may not need perception and upper-level planning, relying only on底层 control. As we move right on the spectrum, robots need stronger environmental perception and task planning capabilities, and technical complexity increases accordingly. AI plays an increasingly important role in perception and planning.

For general-purpose robots at the far right, they need powerful perception, planning, and control capabilities because they may have more joints, sensors, and need to execute complex long-term and short-term tasks. While the底层 logic is interconnected, the technical depth varies greatly depending on requirements for flexibility and autonomy.

Jun Hong: On robots, Elon Musk has rolled out Optimus, and Boston Dynamics also has humanoid robots. Do robots necessarily need to be humanoid? How do people think about this?

Lily: Human labor shortage is a real problem, and many tasks were designed from the start for human execution — like manipulation and搬运 tasks. Therefore, humanoid robots are seen as the ideal ultimate solution for the future. However, in the process of achieving this goal, some transitional robot forms may emerge, such as robotic arms, multiple arm combinations, or installing two or three arms on mobile platforms and other complex robot forms.

Jun Hong: For example, if we give a robotic arm three arms, it could use two to move boxes and one to sort.

Lily: Yes, it can do collaboration — it's actually the same as humans executing tasks, a carrier for multiple actuating joints to collaborate.

Peter: On the necessity of humanoid robots, we can consider it from rational and emotional dimensions: Rationally, since human society and work environments are designed for humans, humanoid robots can quickly adapt to these environments and become the most flexible and general-purpose robot form. Additionally, unifying humanoid robot design helps achieve规模化 production, thereby reducing costs so robot quantities can reach billions or even tens of billions. Although current technology isn't mature and achieving highly realistic humanoid robots still requires foundational breakthroughs, from a rational perspective, humanoid robots are the ideal goal for future development.

Emotionally, many robotics enthusiasts and geeks with deep interest in robots have strong emotional attachment and aspiration toward humanoid robots. Futurists might accept the view that human civilization may gradually transition from carbon-based to silicon-based civilization, that future society may have large numbers of robots coexisting with us, even surpassing us, continuing our civilization, and exploring the universe or interstellar space. In this scenario, humanoid robots may become a symbol of human civilization, establishing stronger emotional connections with us and inheriting the beautiful dreams of human civilization.

Jun Hong: I wonder if humanoid robots can also score more sympathy points — sometimes when I see someone kicking a humanoid robot, because its shape is so human-like, there's a feeling that it hurts, even though I know it doesn't feel pain; but if someone were kicking a metal sheet or box, people might think it's fine.

So next, a question for Lily — when we talk about these robots, are their requirements for chips or GPUs high?

Lily: This entirely depends on their application functional requirements. Because they need very thorough understanding of scenarios, requiring multimodal input, the GPU requirements will be higher — meaning the higher the AI capability requirements, the higher the GPU capability requirements.

Commercial prospects for robots

Jun Hong: Between more vertical and more generalized robot commercialization, who does Peter favor more? What changing trends are you seeing?

Peter: This is something people care about a lot, but let me pour some cold water: in fact, there are no precedents for general-purpose robot commercialization, or it's still at a very early stage. While people are full of expectations for robot commercialization, we must realize that today's general-purpose robots are still in the stage of early research,底层技术 breakthroughs, and technological exploration.

Products that have actually appeared on the market are basically robots for various professional scenarios or specific细分场景. Even so, truly successful scenarios with scale reaching millions or even tens of millions of units are very rare. This reflects that robot commercialization is an extremely complex and difficult process. It involves not only technological breakthroughs and innovation, but also finding good product-market fit, while meeting customer requirements for cost and reliability.

I believe robot commercialization is actually a long process. For many entrepreneurs or investors, it's easy to underestimate the difficulties robots face in the short term, while focusing more on long-term possibilities.

Jun Hong: Because I know you started investing in robots very early — do you think previous vertical application robots have commercialized well?

Peter: Although today we see numerous robot research projects and demos, including various forms of humanoid robots, legged robots, etc., with many development kits, the robot categories that have truly achieved commercialization — say, selling 1 million units — are actually quite few. To put it in perspective, annual phone sales are 3 billion, cars sell 100 million, whereas in robotics, categories selling 1 million units annually are very rare.

Currently, the truly commercialized major robot categories are mainly these few. In home scenarios, autonomous robots like扫地机器人 and割草机器人 have been commercialized. In commercial scenarios, the biggest is probably warehouse automation, like Amazon's Kiva system, which has reached 1 million units shipped.

Additionally, drones can be considered a commercialized robot scenario — aerial photography drones, consumer drones, or industrial drones represented by DJI have also reached million-unit sales levels. Using 1 million units as a standard, you'll find that even after two or three decades of development, truly million-shipment robot categories are very few. As for other more open scenarios, they're still at even earlier stages, without规模化 commercialization and productization progress.

Jun Hong: How do you view investment in general-purpose robots?

Peter: The market is very focused on the general-purpose robot direction, and many large companies, including NVIDIA, have made significant investments in this area. However, I must admit that I think today's general-purpose robots are still at a very early stage. Compared to the PC era or mobile internet era, I think it may be equivalent to PCs in 1980 or smartphones in 2005 — core technological elements are rapidly emerging, long-term and early exploration is very active, but very few have formed clear business and product roadmaps.

Therefore, our view is that we will actively follow this direction, but when to make large-scale aggressive investments may depend on timing, product direction, and the team itself. Based on our years of experience investing in robots, identifying technological variables, maturity, and market-facing capabilities is crucial. If we misjudge the maturity of technology, it may mean that even 10 years after company founding, we still can't deliver a truly sellable, commercializable product.

From product and market perspectives, you must find product-market fit — what we commonly call PMF. This is also very difficult for general-purpose robot products, because today we may be able to make a very cool general-purpose robot demo without regard to cost, but a cost-agnostic device can't be sold into home or commercial scenarios. So requirements for PMF and product definition are also very high.

Most critically, from a long-term perspective, it's still about the team. Because general-purpose robot entrepreneurship isn't like software or AI entrepreneurship — it requires not only very good software and algorithm capabilities, but also good hardware capabilities, and ultimately probably very strong business and marketization capabilities. So the entrepreneurial threshold in our view is very high, completely no less than the difficulty of new energy electric vehicle entrepreneurship. Therefore, our standards in this regard will be very high, and we very much look forward to teams that can satisfy multiple aspects to undertake entrepreneurship and attempts in the general-purpose robot field.

Jun Hong: As NVIDIA's head of robotics business in China, Lily, how do you view the current落地情况 of China's robotics industry in the Chinese market?

Lily: I think China does very well in robotics globally. Thanks to the strong productivity background of intelligent manufacturing, from NVIDIA's perspective, China's robotics industry development is excellent. Currently,生活服务类, intelligent manufacturing, as well as automotive and warehousing logistics are all quite good.生活服务类 can be understood to include扫地机器人,割草机器人, etc. Although China may not have widespread割草机器人 demand, China is a very large OEM export country for割草机器人, so it also does very well in this area. Additionally,生活服务类 includes delivery services, such as drone delivery, unmanned vehicle delivery, especially for fresh goods and some direct-to-home service scenarios. In industrial production manufacturing, operational and搬运 robots are also quite widely applied.

Jun Hong: From an investment perspective, Peter, do you think this is a field that can be invested in at scale?

Peter: Objectively speaking, although the robotics industry is very active, its true scale and magnitude aren't that large. As I mentioned, robot categories with shipments exceeding 1 million units are very few. If the entire industry only produces, say, 100,000 robots, the corresponding output value is actually relatively small. Take送餐机器人 as an example — although it's a fairly active category, global annual shipments may be less than 50,000, even less than 100,000. Multiply that by unit price, and the entire industry's output value is relatively small.

In such a relatively small industry, finding particularly large investment opportunities is actually quite difficult. This is precisely why people have such high expectations for general-purpose robots and humanoid robots — people hope that robot generalization can greatly expand robot application scenarios and market scale. Although this hasn't happened yet, people maintain willingness and interest to continue investing in this prospect.

Jun Hong: I understand that the robotics industry has become hot again this year partly because people see some improvements that large models bring to robotics. As you just mentioned, the current development stage of the robotics industry may be similar to the internet in the 1980s or mobile internet in 2005. So is now the time to start paying attention to and investing in general-purpose robots?

Peter: We actively follow numerous companies domestically and internationally — we've actually already engaged with and researched them. But I must admit that we believe the robotics industry is still at a very early stage. If our true goal is general-purpose robots, the challenges they face, and the time and cost required, may be enormous — the difficulty may even exceed what Tesla faced developing electric vehicles in 2005. Building the entire ecosystem is still at a very early stage, so regarding investment in general-purpose robots, I maintain a relatively cautiously optimistic attitude.

At NVIDIA GTC in March, there was a session inviting Boston Dynamics founder Marc Raibert to discuss general-purpose robots and humanoid robots. He made an interesting point: for humanoid robots as representative of general-purpose robots to achieve规模化 commercialization, at least 10 years or more are needed. He believes this cycle will exceed the lifespan of most venture capital funds. In other words, he doesn't recommend that venture capital truly participate in such investments, because the required time horizon far exceeds what people imagine.

Jun Hong: Is your judgment consistent with his?

Peter: This represents the conservatively optimistic attitude of scientists and researchers who have truly worked in humanoid robots for 30 years. Although we currently see many changes and variables in this field, with robots able to grasp apples or other items, these technologies are still quite far from true commercial落地.

Take grasping technology as an example — we've invested in some grasping companies and pay close attention to this field's development. In a lab, grasping and releasing an apple in a clean environment, even with 99% success rate, is still far from true industry落地 requirements. In factory environments, accuracy requirements aren't 99%, but 99.99% — meaning only one failure allowed in 10,000 grasps.

Jun Hong: We mentioned in last year's podcast on robots and large models that if we could get a robotic arm to accurately grasp a cup in any space, under any lighting, it would probably be Nobel Prize-level achievement.

Peter: Yes, this is a crown-level problem in robotics. But for a person, even a worker with just an hour or two of training, they can basically achieve 100% no-errors or no-failures on a production line, and very dexterously, very quickly. So this means that to落地 robots in such scenarios indeed requires very careful thinking about product path. Making a very cool demo is far from true commercialization. Returning to our investment work — we're investing in commercially successful companies, not companies that only excel at making demos and continuously fundraising. So this sets very high requirements for our investment and entrepreneurship.

Jun Hong: How would you evaluate whether a robotics company is good and worth investing in?

Peter: Investment is a multi-dimensional consideration. As a company, it must choose the right entry point at the right time, and execute with a suitable team. For example, in 2014 and 2015, some扫地机器人 companies emerged in China; in 2021 and 2022, some割草机器人 startups appeared; and in 2016 and 2017, a batch of warehouse robot companies emerged. The timing and market entry points of these companies were closely related to the maturity of technology and demand at the time.

As investors, for entrepreneurs, we need to judge at this point in time which technologies and capabilities have matured to the stage of commercializable entrepreneurship, and whether this team has truly found the right people. We must consider not only whether they may have very good backgrounds, logic, or stories, but whether they can truly deliver scenarios or products. This is quite a comprehensive matter, and a process we explore and learn from daily.

I believe Chinese companies have tremendous opportunities in the robotics entrepreneurship wave. Successful robotics companies in the past 20 or 10 years have overwhelmingly come from China, thanks to our advantages in supply chain, R&D, and rapid iteration落地 scenarios. If it's a team doing robotics entrepreneurship in the US, I would consider whether this team is suited for fields with very high precision and performance requirements and low cost sensitivity — this may be the element of success for US companies in robotics entrepreneurship over the past 10 to 15 years. These elements may be exactly opposite to what makes Chinese companies successful.

Therefore, fields where US companies are more likely to succeed may have lower probability of success for Chinese companies. Conversely, in areas where Chinese companies have obvious advantages —扫地机器人,割草机器人, warehouse robots, service robots, delivery robots — I believe Chinese companies have greater chances of success.

Jun Hong: Overall, the robotics track is worth looking forward to, but we're still at a very early stage.

Peter: Yes, in AI's development history, we've experienced three低谷 periods. This is because in the short term people have high expectations for technology's possibilities and breakthroughs, but when technology can't meet these expectations, people become disappointed. I worry that the robotics industry may face similar situations in the next two or three years, especially with large amounts of capital and entrepreneurs pouring into long-term, difficult fields like humanoid robots and general-purpose robots.

If these companies don't achieve expected progress or results in the next two or three or four years, the market may enter a低谷 or disappointment period. Facing such situations, how we recover from such environments and stand up again, truly pushing technology toward落地 and commercializable directions, is something I'll pay close attention to in coming years.

Jun Hong: Is this process similar to the development journey of autonomous driving?

Peter: If we went back to 2015 and told everyone that by 2024 we still couldn't achieve commercialized, profitable Robotaxi operations, I think many people might not have chosen to enter this industry, or wouldn't have invested so aggressively in this field. Technology development often requires a long process and won't be smooth sailing, especially in fields involving interaction with physical and natural environments. If we're to solve humanoid robot or general-purpose robot problems, the difficulty they face far exceeds the difficulty of autonomous driving technology commercialization and落地.

Jun Hong: One benefit is that when technology bubbles appear, they attract many top smart people into the industry. Although the industry may not reach the ideal state people initially envisioned, in the process of technological development, it may bring many new commercial scenarios and application forms that people hadn't anticipated. This is a very interesting phenomenon.

[Related supplementary information]

Embodied AI: Embodied Artificial Intelligence refers to an intelligent system based on physical embodiment for perception and action, which acquires information, understands problems, makes decisions, and implements actions through agent-environment interaction, thereby producing intelligent behavior and adaptability. In 1950, Turing, in his classic paper Computing Machinery and Intelligence that founded AI and proposed the Turing test, concluded by envisioning two possible development paths for AI: one focusing on abstract computation (such as chess) required intelligence, the other equipping machines with the best sensors, enabling them to communicate with humans and learn like infants. These two paths gradually evolved into non-embodied and embodied intelligence.

PaLM-E: PaLM-E is an embodied multimodal language model proposed jointly by Google and Technische Universität Berlin in 2023. This model can directly incorporate real-world continuous sensor modalities into already pretrained large language models, thereby establishing connections between words and percepts. Its core design concept is injecting continuous, embodied observations (such as images, state estimates, or other sensor modalities) into the language embedding space of pretrained LLMs.

PaLM-SayCan: A robot learning algorithm proposed by Google Research in 2022, combining large language models with pretrained robot behaviors, where the robot serves as the "hands and eyes" of the language model, and the language model provides high-level semantic knowledge about tasks. This approach enables robots to execute complex physical tasks according to natural language instructions, while ensuring these tasks are feasible in specific real-world environments.

Google RT-1/RT-2: Advanced robot learning models developed by Google's robotics research team. RT-1 (Robotics Transformer 1) is a vision-language-action (VLA) model; RT-2 is its evolved version, aiming to train an end-to-end model that can go directly from robot observations to actions while leveraging advantages of large-scale pretrained vision-language models. RT-2 obtains better generalization and emergent capabilities by pretraining on internet-scale vision-language tasks, then fine-tuning on real-world robot tasks.

What are your views on embodied AI and general-purpose robots? Welcome to share your thoughts after reading this content, or related perspectives in the comments. We'll select 2 featured comments to receive a gift prepared by 5Y Capital :)

5Y Capital seeks, supports, and inspires lonely entrepreneurs, providing them with support from spirit to all operational aspects. We believe that if the "crazy you" in others' eyes begins to be believed in, the world will become refreshingly different.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG