Li Feng in Conversation with Lian Wenzhao: The Promise and Hype of Large Models, the "Impossible Triangle" of Robotics, and the Future | FreeS Fund Venture Dialogues

峰瑞资本·April 19, 2024·31·3

This is an unprecedented opportunity: technology-driven, industry-anchored.

This is the second installment in FreeS Fund's Embodied Intelligence Series.

Not long ago, we spoke with Zhang Wei, founder of LimX Dynamics, about progress in humanoid robot lower-body mobility. Precise manipulation capabilities for robot upper bodies are also critical to broader robot deployment. This time, we continue our focus on frontier developments in embodied intelligence, with an in-depth discussion with Lian Wenzhao, a researcher at the Chinese Academy of Sciences' Institute of Automation, on topics including fine motor control in humanoid robot upper limbs and industry commercialization.

Lian Wenzhao graduated from the Department of Electronics at Shanghai Jiao Tong University in 2011, and earned his master's in statistics and PhD in electrical and computer engineering from Duke University in 2015. After graduation, Lian worked consecutively at top technology companies, all in roles related to robotics — including research at Vicarious, an AI company heavily backed by Silicon Valley tech luminaries, and robotics R&D at Google X. Before returning to China, he served as technical director at the buzzy robotics company Figure AI, where he led the hardware-software integration and core algorithm development for dexterous robot manipulation. In 2023, he decided to swim against the current and return to China, exploring how embodied intelligence could drive industrial commercialization at home.

Drawing on Lian's unique experience and perspective, Feng Shu discussed with him:

What does the history of the robotics industry look like?
What stage are we at with fine manipulation capabilities for humanoid robot upper limbs, and what technical challenges remain?
What different advantages and challenges do China and the US face in the robotics industry?
Which technological advances in recent years have expanded application scenarios for robots?
How do imitation learning and reinforcement learning contribute to upper limb coordination and dexterous hand control?
How do you break the "impossible triangle" in the robotics industry?
What is the future of humanoid robots?

We hope this offers fresh perspectives and insights. Stay tuned for more from FreeS Fund's Embodied Intelligence Series. Beyond the entrepreneurial perspective, two of FreeS's investment colleagues will also host a discussion on embodied intelligence.

You can also find the full interview on Xiaoyuzhou / Apple Podcasts / Ximalaya by searching for and subscribing to "High Energy."

Giveaway

What robot videos have crossed your feed in the past year, and which one left the deepest impression? Share in the comments. By 5:00 PM on April 26, the five most thoughtful commenters will each receive a FreeS Fund industry research handbook and a copy of The Age of Discovery.

/ 01 / Imagination or Bubble: What Have Large Models Brought to Robotics?

Li Feng: Since your PhD, your entire career trajectory has been tied to the robotics industry. I'd like to hear your perspective from beyond the headlines — how have you observed the US robotics industry changing?

Lian Wenzhao: I'll take the long view. It's true that the US robotics industry has developed very rapidly in the past two years. But in reality, robots have been deployed for five or six decades already. Initially, they were mainly used in automotive manufacturing and 3C electronics, because production volumes in these industries were massive. Companies like ABB and KUKA were early pioneers in developing and applying industrial robotic arms. However, at that time, these robots were purely repetitive — programmed for repetitive motion with no intelligence whatsoever.

Li Feng: Let me add some historical context. First, around the mid-to-late 1940s, automation theory and early computer technology emerged simultaneously, primarily in the US. The timing is significant — this was right after World War II. But despite the American origins of these technologies, divergent national conditions over the following decades ultimately led to the last generation of robots developing in Europe and Japan rather than the US.

There's an interesting historical process behind this.

The US benefited from both world wars because it avoided major fighting on its own soil, producing weapons and materials for other countries during wartime. After WWII, America's situation differed from that of defeated or heavily involved nations. For one, it hadn't suffered massive labor losses. Second, with no need for large-scale military production, market demand suddenly contracted, making employment the urgent priority. Robots were also extremely expensive. So even though the US had the technological edge in robotics, practical applications were very limited.

During this same period, Japan and Germany had suffered enormous personnel losses, especially among young working-age men. Post-war, they urgently needed to restore production and develop consumer industries, and were desperate for production tools that could substitute for labor. Initially, they purchased expensive prototype robots from the US, then deployed them widely in electronics and automotive — the two industries that would become the main engines of Japan's and Germany's post-war economic recovery. Japan and Germany successively gave rise to the globally leading robot companies of that era, the so-called "Big Four" family of robotics (Switzerland's ABB, Germany's KUKA, and Japan's FANUC and Yaskawa).

Lian Wenzhao: This history is highly significant and carries great relevance today. Looking at both supply and demand sides, conditions for developing robots are now relatively mature across the world — including the US, Japan, Europe, and especially China.

Returning to your earlier question, the earliest generation of robots were more like automated machines without intelligence. As technology progressed, we got pattern recognition, which enabled robots to perform simple intelligent operations like line tracking and edge detection. This gave robots a degree of visual capability, allowing for visually guided motion — practical applications like AGV (automated guided vehicle) carts, and robotic arms capable of more flexible operations. Gradually, application scenarios began to slowly expand.

As applications broadened, hardware costs gradually declined and production volumes increased. This alternating iteration between hardware and software drove technological progress. Over the past 20 years, 3D vision technology has unlocked even more robotic applications, enabling pose estimation of objects, segmentation, and bin picking from unstructured piles. This alternating hardware-software development has propelled the robotics industry's continuous upward spiral. Later, we saw AGVs enter homes; Google open-sourced SLAM (simultaneous localization and mapping) technology in 2016, bringing more companies into the space, combining hardware with more advanced software to expand applications further. Now robots can serve not only B2B enterprises but also enter users' homes. The past few years have been even more interesting. Large models exploded onto the scene, though they haven't yet truly found solid footing in robotics.

Li Feng: So from what you're saying, while large models have brought imagination and hype to humanoid robots, they haven't actually connected yet?

Lian Wenzhao: Large models haven't reached that stage yet. Consider how 3D vision, 2D vision, and pattern recognition have proven themselves reliable and useful in other domains. When applied to robotics, they enhance robotic capabilities. However, large models haven't yet proven themselves in other domains as reliable tools that significantly boost productivity.

They can assist humans, but haven't reached the level of independently completing tasks. To give a concrete example: using OpenAI's ChatGPT for conversation or creative work — if we don't have a preset correct answer, it can spark our imagination wonderfully. But if we have a specific request, say generating an image, and that image is very clear and concrete in our minds, then ChatGPT may require multiple attempts, and we might eventually give up.

Li Feng: Understood. In internal discussions about large models, I often use the analogy of a three-, four-, or five-year-old child. If you have a question and don't have a predetermined answer, the child might give you a surprising response that seems adorable and delightful. But if you're seeking something logical, rational, or convergent, the child may not meet expectations. Large models are similar — you need to guide and progressively narrow the question.

Lian Wenzhao: I completely agree. In the robotics industry, interaction with the physical world typically has correct answers. For instance, picking up a cup — did you successfully pick it up and pour water? Did the water reach the 3/4 level I wanted? These have standard answers.

Li Feng: How did you manage to say "3/4 cup" with such precision in your example?

Lian Wenzhao: Some situations require quantitative description. While people mostly perceive things fuzzily in daily life — like whether something is roughly to the left, right, front, or back — in specific contexts we need precise measurement, and the margin of error depends heavily on the situation. We call this common sense or intuitive physics. Large models currently lack this capability; their precision depends on the precision of their training data. But humans can adjust this precision.

Li Feng: "Intuitive physics" is a fascinating concept — we'll discuss it in more detail shortly. What's the theoretical foundation of this concept?

Lian Wenzhao: At MIT there's a professor, Josh Tenenbaum, who studies cognitive science. He started researching based on the previous wave of artificial general intelligence very early on. He tends to use small data, or probabilistic methods, drawing inspiration from the human brain. When I was working at Vicarious, I also hoped to draw inspiration from the human brain to design neural networks that wouldn't be overly dependent on big data and purely statistical methods. I was heavily influenced by that approach and explored it extensively.

Li Feng: I have another curious question — how do you view the changes at these American companies today. Your academic background was quite strong; most people with that profile went to the hottest internet giants at the time, which seemed to offer boundless hope and high compensation. How did you decide to go into robotics? Did you already know that robotics would be hot ten years later?

Lian Wenzhao: I wish I had that superpower. A lot of things are actually quite random, but that was indeed how I thought about it at the time. There were no large models back then; the hottest concept in machine learning was "big data." Upon graduation, many of my senior classmates did go to companies like LinkedIn, Facebook, and Google. But I felt that at these large internet companies, what each person could accomplish was actually quite limited. I cared more about whether I could make a difference in an industry or technology.

In my final two years of PhD, I collaborated with a professor working on robotics and machine learning research. There were no large models then; we did some model-based reinforcement learning, working only on simple pick-and-place tasks. Though these efforts seem very simple and basic in retrospect, they gave me tremendous inspiration.

At the time, I felt the robotics industry was full of opportunity — like looking down from 3,000 meters in the air, with gold everywhere on the ground. If you could create anything with even a bit of intelligence, it would enormously advance the field of robotics.

So my decision point was mainly based on this kind of thinking: I wanted to try something new, something with a steep enough slope that I could bring about change. That the robotics industry later became so hot — I hadn't anticipated that. My other logic was that humanity's ultimate ideals are to eat well and be lazy, and to live forever. Betting on these two areas definitely wouldn't be wrong, so working on AI and robotics was certainly the right call.

/ 02 / Vicarious, Google X, Figure: Different DNA, Visions, and Attempts

▎Vicarious Chose a Harder Path Than DeepMind

Li Feng: You've worked at several companies — Vicarious, Google X, Figure — each with different backgrounds, and you joined them at different stages. In your view, in terms of doing robotics, how did each company's DNA and stage shape their approach?

Lian Wenzhao: Every company or organization, though doing similar things, started from completely different places. Going chronologically: Vicarious's goal wasn't merely to build robots; what we hoped to achieve was artificial general intelligence. People might ask, what could prove that AGI had been achieved? So we designed IQ tests analogous to the Turing test — if we could score 140 or 160 on an IQ test, that would prove the existence of AGI.

Li Feng: If your test result is only 140 and you want it to reach 160, wouldn't that mean building a robot smarter than yourself?

Lian Wenzhao: That's inevitable. Robots can access far more information than we can, and their capacity for self-iteration is certainly stronger than ours. We should have this mindset of acceptance — we should want to see our children become smarter than us.

At Vicarious, we didn't want to merely build an artificially designed product; we hoped to prove through organic validation that we had achieved general intelligence. DeepMind's strategy was to play games, Go, chess — these turned out to be easier paths to success because they took place in the digital world, where data is infinite. They could leverage vast amounts of historical data to learn how to play games, even simulating gameplay themselves.

Li Feng: In our previous conversation with Professor Zhang Wei, we touched on this a bit — the so-called digital world. Games, Go, chess — these completely digitized closed-loop processes have the advantage that machines can learn from what humans have done: Go game records, chess game records, and video or control processes of gameplay. Because these processes are 100% digitized, machines can self-simulate, can play left hand against right hand. AlphaGo demonstrated this eight years ago — it could play against itself, simulating all possibilities 24/7 without interruption, and eventually defeat human players.

Lian Wenzhao: There are multiple factors here. First, in the digital world, the cost of generating data is extremely low — machines can play against themselves, using cloud computing resources to run many virtual machines in parallel. Second, the digital world can discretize all spaces, observations, and action spaces; in theory, all possibilities can be exhaustively enumerated.

But in the physical world, all variables are continuous. Anyone who studied electrical engineering knows digital and analog circuits — analog circuits are more challenging. Some digital circuit design can be automated or AI-enabled; analog circuits rely more on human experience, which is difficult for machines to replace. In robotics and physical-world operational interaction, the relevant data is hard to exhaustively enumerate, or even if you wanted to do so, the cost and price would be enormous. At Vicarious, we chose this path, and it ultimately proved to be far more difficult than the digital-world path.

Google X, Through Acquisitions and Integration, Settled on Three Main Directions: Move, Make, and Help

Li Feng: So when you arrived at Google X, what were its main approaches, thinking, and DNA?

Lian Wenzhao: Google X was more like a research division under the large company Google. In 2013, Andy Rubin, the founder of Android, was very interested in robotics. Under his leadership, Google acquired eight robotics companies in one fell swoop. After a period of turbulence, they eventually settled on three main directions: move, make, and help. Move referred to mobility, such as logistics forklifts; make was industrial manufacturing, which was the project I later joined, publicly known as Intrinsic; the third was help, service robots, which later became known as the Everyday Robot project. Google's strategy was to first acquire many companies, then integrate them into several directions.

Specifically for the Intrinsic project, we hoped to build a universal robot operating system. In the industrial robotics "Big Four," each has its own hardware and relatively closed operating systems. These robot manufacturers originated from automation companies; their software rarely gets updated for decades, and once deployed, typically doesn't change, because what they value most is robustness. This also limits robots' application in more scenarios.

The goal of Intrinsic was to unlock robotic capabilities, so robots could be used not just in "cages" but in more manufacturing scenarios. Now we often talk about flexible production — small batches, high variety, high-mix low-volume production scenarios — which requires what we call software-defined hardware. That was Google X's original intention.

Google knew its strength lay in software, so they wanted to build a universal robot operating system that was hardware-agnostic, applicable to all hardware. Software engineers could sit in offices programming and learning. After learning, we had our own simulation framework to verify in simulation whether robots could complete assembly or other valuable physical labor. After verification, we could deploy to real-world scenarios, abstracting out the hardware development process so that more and more work could be done through software, achieving repeatability and scale. That was Google X's logic — the entire plan was extremely ambitious.

I think only large companies could do this. In the US, probably only Google and a handful of others could; the same is true domestically. For smaller startups, this would be very difficult because the cycle is so long.

On one hand, there's the technical question of whether software abstraction of the physical layer can be achieved, starting from the motor layer or even current commands, gradually abstracting to the top level, writing software frameworks for data transmission or control.

On the other hand, pushing this through the entire industrial chain is very complex. Industrial production includes the primary, secondary, and even parts of the tertiary sectors — the entire chain is established. In the traditional chain, closest to the user are system integrators, who play an important role: procuring components from multiple upstream suppliers, adding some software themselves, even defining some hardware themselves, then integrating into customer projects. On the factory side, there are IT departments or production departments responsible for interfacing and ensuring smooth operation. There are many links in this chain, many middle layers sharing the benefits.

If we have a new framework or system wanting to enter this chain, it's equivalent to splitting the cake with existing participants — we must prove we can provide greater value, that we can expand the entire market's cake. From a commercial deployment perspective, this is also a relatively slow process requiring massive upfront investment, but I believe it will eventually be achieved; how long it takes depends on industrial inertia.

If Today's Technical Progress Had Been Achieved Earlier, How Would Google X and Vicarious Have Been Different?

Li Feng: Looking back at your work on the Make part of robotics at Google X, and previously at Vicarious wanting to use robots to prove the last round's AGI — both faced their own challenges and goals. Looking back today, if large models, Transformer algorithms, and other software and hardware technologies had already existed back then, would Vicarious's robots or Google X's Intrinsic project have turned out differently?

Lian Wenzhao: This is an especially difficult question; we can do a thought experiment. In many science fiction works, we often change one factor, but other factors would change too. For example, we might see robots running around in 2050 households, but that father is still reading the newspaper in the morning. Changing only one element may not allow for very accurate, objective speculation.

Li Feng: Of course.

Lian Wenzhao: If we assume Vicarious had possessed today's visual large model technology back then, work in object detection and segmentation would have been much simpler. We wouldn't have needed massive data annotation and training; technology deployment would have been much easier. At the time, for each new third-party logistics task, we needed to train new models, requiring enormous human and material resources.

For some objects, deploying an untuned model directly might only achieve 70% effectiveness; without tuning to 99% precision, it couldn't be delivered. This meant we needed to deploy large numbers of PhDs and engineers on-site for debugging — the cost was enormous, and commercially unviable, because we were essentially turning a product into a customized project. But now with vision-based large models, this deployment work could save massive costs.

However, there isn't a one-for-all solution on the hardware side — different objects require different end effectors. Getting a robot to grasp an umbrella versus a phone requires different "hands." There currently isn't universal hardware adaptable to all scenarios. So custom hardware may be a future development direction.

Li Feng: By custom hardware, do you mean the robot identifies an object and then autonomously swaps to an end effector suitable for grasping that object? Or after identifying the object, adjusting the control logic inside the mechanical control components, such as sensitivity, precision, and force? Simply put: is it swapping hands or swapping control methods?

Lian Wenzhao: In reality, both are necessary. At the hardware level, we've always faced a choice: build a Swiss Army knife-style collection of multiple tools, or build a hand that can use various tools. It's a dilemma. The Swiss Army knife approach is relatively straightforward — we can list out the right tool for each scenario and choose accordingly — but costs become difficult to control. The latter approach sounds appealing in theory: just build a good hand and it can adapt to countless scenarios. But this demands far more from algorithms and control systems.

If our algorithms are powerful enough to achieve artificial general intelligence, then designing a single piece of hardware that can operate any tool would be the ideal state. But that degree of difficulty is relatively high.

Li Feng: If you were to start a company to solve this problem today, which approach would you choose? Swapping out different tools like changing hands, or giving the hand multiple different fingers — a screwdriver, bottle opener, drill bit, and so on?

Lian Wenzhao: Looking at the big picture, we absolutely need to build the hand. The reason it's called a humanoid robot is that our goal is to manufacture and use tools. That means we need to build a truly generalizable, versatile hand capable of operating various tools. Though this goal is extremely difficult in the near term, we're still pushing in this direction. Currently, the five-fingered hands we see might have six joints, and people are working hard to add more. But controlling so many degrees of freedom flexibly in multi-object, multi-contact environments remains far from ideal.

Li Feng: Setting aside tactile sensing or sensory issues, focusing purely on controlling dexterous hands, which factors are hardest to reconcile? For example, increasing joint count might outpace existing control systems, or improving joint sensitivity could limit joint force and bending degrees of freedom.

Lian Wenzhao: There are challenges across hardware, software, and their interaction. First, from a pure hardware perspective, a massive challenge is integrating multiple degrees of freedom within limited space. A humanoid robot might need around 30 motors, while we hope to integrate more than 10 degrees of freedom in the hand alone. This is technically extremely difficult, because we want the hand to achieve both fine manipulation and sufficient strength — not merely capable of picking up a paper cup, though even that is already quite hard.

But realizing these ideas within existing motor size constraints is a hardware challenge. I believe this will gradually be solved over time. The current reality is that even though we may already have humanoid hands with six degrees of freedom, their actual capabilities are weaker than our previous two-finger grippers.

Two-finger gripper operation is a simple grasping motion, like a better pair of chopsticks — easy to control and highly robust. But when we now try to control hands with more degrees of freedom, they need to coordinate with each other. Without effective algorithms, we're essentially degrading them into a pair of chopsticks, and worse ones at that.

At the software level, we need to consider how to plan and adjust joint commands in real time. For large-range movements — robotic arms, wheels, or wheeled legs — we don't make high-frequency decisions. But when human hands interact with objects, the brain doesn't need to be heavily involved; many actions are reflexive. For instance, once you feel heat, your hand withdraws naturally, and only then do you consciously feel the burn. Or when twisting a bottle cap, when you feel resistance, your hand naturally reduces force or slows down. These are local decisions. Current algorithms struggle to achieve this kind of instantaneous response.

The third point is the interaction between software and hardware — a chicken-and-egg problem. At Figure, I was building the hand while developing algorithms, like flying a plane that wasn't finished yet — flying, repairing, and designing how to adjust its steering and speed all at once. The process was quite painful, but I believe once completed, its value will be enormous.

Figure AI's Vision Hasn't Changed Much; What Changed Is How People See It

Li Feng: Since you mentioned Figure, from an outside perspective the company went from obscurity to overnight fame this year. During this process, did you sense any obvious changes or qualitative leaps in the company?

Lian Wenzhao: I think Figure's overall strategy has been relatively continuous and consistent. From the company's founding to now, its vision hasn't changed much. What changed more is the outside world's view of this field and this company.

Figure had strong hardware genes from the start. CEO Brett Adcock previously founded Archer Aviation, a vertical takeoff and landing aircraft company that successfully went public. He's very sensitive to and passionate about machinery and electronics. So early on, Figure attracted top experts in drive systems and motors from companies like Boston Dynamics, Tesla, and Apple. Some Figure colleagues had worked on Boston Dynamics' early Atlas humanoid robot, Spot robot dog, and Stretch logistics robot — they have rich experience in hardware innovation. But while developing the hardware platform, Figure was also determined to integrate AI technology to differentiate from historical achievements. So Figure's core was to use the hardware platform as a foundation, gradually integrating AI capabilities, perception capabilities, and learning capabilities, growing step by step.

But Figure has always been a commercial company, so we wanted to find scenarios with near-term commercial viability, such as logistics and factories. As society and capital paid more attention to this industry, more resources began flowing in. Many strategic partners — OpenAI, Microsoft, Google, as well as chip manufacturers, cloud computing companies, and large model developers — started collaborating with Figure, combining their technical capabilities with our local robot platform to amplify its value.

Li Feng: Figure has received investments from many prominent individuals and institutions over the past year or so. With the development of large models and this recent round of notable funding news, after you left, is Figure's current progress and development noticeably different from before?

Lian Wenzhao: I haven't been gone very long, and colleagues still communicate frequently. Within what can be disclosed, I believe that with resources, Figure can certainly do more.

Robot R&D in the US is extremely costly. While not as expensive as chip manufacturing, each hardware iteration requires substantial time and capital investment. We've done many experiments and development efforts. Software development experiments can fail without consequence, but hardware experiments are quite expensive. So when these resources became available, iteration speed could increase dramatically. It's the same for large models and computing power — Figure has more resources to collect data and iterate, and this acceleration is very obvious.

Li Feng: Regarding Figure's recently released robot demo, compared to before, do you think it has changed remarkably?

Lian Wenzhao: I have to say the entire demo was fantastic. Figure does status updates every few weeks to objectively reflect the company's progress. The recent demo combined large models, mid-level imitation learning, and low-level holistic control, demonstrating that the overall framework design can achieve very flexible tasks. I also noticed comments frequently mentioning the smoothness of movements. This is because we integrated upper-level, mid-level, and lower-level tasks with different frequency requirements, bandwidth requirements, and computational demands into one framework, defining their interfaces so they connect properly. This is no simple task, so Figure's progress is solid — we did a lot of work.

Li Feng: From your perspective, over the past 10 months or year, has Figure's development exceeded your expectations?

Lian Wenzhao: Objectively speaking, from a technical development perspective, achieving tasks of this complexity was within expectations and foreseeable. From a funding perspective, it definitely exceeded expectations, because people's perception of and interest in Figure changed dramatically.

Li Feng: If this major funding round had happened right when you were considering returning to China, would you still have resigned?

Lian Wenzhao: The answer is simple: absolutely yes. I believe people don't regret choices they've made, but rather choices they didn't make. When I made the decision to return, I had already considered the worst-case possibilities. I cared more about what things would look like if done well, what role I could play in it, and what difference my participation or absence would make.

Taking Figure as an example, the Bay Area has very high talent density. With abundant resources, the company can naturally attract more talent. I left, and the company continues to develop, and quite well. But after returning to China, I can do more impactful things.

The humanoid robot field seems hot right now, but viewed through the long lens of history, it's still in early stages. Many technical paths, commercial scenarios, maturity levels, and even societal roles have yet to converge. It's actually still a very early embryonic thing; the road ahead is long, with many nodes to choose from along the way. I believe each node's choices will lead different people down different development paths.

I hope to rely on my accumulated experience, knowledge, and decision-making to carve out a distinctive path in robotics. I tend toward exploring a direction I can control myself, guiding a project to gradually advance in an unknown field. Like playing World of Warcraft — when you face a completely blacked-out map, you can only see a small portion in front of you. But the process of searching for light within this unknown and darkness is itself full of meaning.

China vs. US Robotics Industry: Where Are the New Startup Opportunities?

Li Feng: Since returning to China, how would you compare China's and America's robotics industries across dimensions like software, hardware, algorithm models, and control?

Lian Wenzhao: China and America's robotics industries have fundamentally different DNA. There's no doubt that China's hardware manufacturing capabilities are extremely powerful. Whether in production speed or cost control, China has clear advantages.

Our robotics industry can easily amplify this advantage. Robotics is still in an early development stage requiring massive R&D and iteration. If I can iterate every two weeks or every month, versus every six months or every year, the speed of progress is naturally vastly different. This is similar to OpenAI training large models — they use hundreds or thousands of GPUs for training, while we might use only dozens. This speed difference is enormous.

But in talent accumulation, we must acknowledge there's a certain gap. America has deeper talent reserves in mechanics, electronics, control, algorithms, and other fields. This talent includes not only engineers with hands-on experience working on automation equipment and traditional robotics projects, but also people from emerging fields like imitation learning and reinforcement learning.

In China, when we discover demand in a certain industry, people only gradually begin entering that field. For example, some transition from autonomous driving, others from new energy — these talents may encounter difficulties when entering a new domain. But I believe our talent reserves will improve quickly. After all, China has a demographic dividend in engineers and scientists. At least in talent quantity, we have clear advantages.

Additionally, at the government level, the support China can provide is very substantial. This has already been proven in the new energy vehicle sector. Government subsidies and encouragement are crucial for driving industry development — without early government incentives, this would have been very difficult to get off the ground.

In the US, this scenario is unlikely — at least for now, in the humanoid robot industry. The US leans more toward market incentives: factory workers are getting more expensive, so can we use robots to replace workers making $17 an hour? This logic demands more from the robotics technology itself, and will likely drive adoption more slowly.

In China, if the government pushes humanoid robot development, things will move very fast. Currently, there are two main problems holding humanoid robots back: first, costs are too high; second, capabilities aren't strong enough. Right now we're discussing how to achieve scaling law — the more data, the stronger the robot's capabilities. But where does data come from? Only when robots are widely used can we accumulate enough data. This creates a loop: no data means no capability, no capability means no cost reduction, and without lower costs there's no widespread adoption. But if we have external forces stepping in, such as government subsidies, to boost production volume, then costs will come down. Once costs drop, we can promote large-scale adoption, which lets us accumulate more data, which improves robot capabilities, which further increases usage. This is a huge advantage for us domestically.

The US advantage lies in cutting-edge research — whether in large models, underlying control technology, or hardware design, they are indeed ahead of us. We must acknowledge that in original innovation, going from zero to one, the US remains in the lead. Right now it's hard to say which technology will ultimately win out, because the technology hasn't converged yet. Whether it's imitation learning, simulation learning, or reinforcement learning, various approaches are still being tried.

Li Feng: That's exactly right. Drawing on your experience and understanding of both China and the US, as well as current industry developments, if you were to choose a direction to start a company in China today, which areas do you think offer greater opportunity? Or put another way, where are the easiest points to break through and establish advantages?

Lian Wenzhao: A broad and simple answer can be summarized as "technology-driven, industry-anchored." Humanoid robots are still fundamentally a technology-push domain. We need to solve many technical problems; the complexity of the entire system involves sensors, hardware, algorithms, various modules, even software-level challenges. So this is a technology-driven field — if your technology can achieve absolute leadership, you can maintain competitive advantages for a relatively long time. This is a key premise.

On the other side, speaking of anchoring: no matter how humanoid robot technology develops, it ultimately needs to create value. How to create value depends on whether you can find a match point between technology and market demand as early as possible — "tech-market fit." This is crucial; the faster you find this fit, the better.

It's true that the US has excellent technological advantages, but China has massive industrial chains and market demand — in new energy and logistics, for example, there's virtually no second country in the world that can fully take this on. If we can develop some effective technical solutions, we can drive upgrades and development in these industries.

What Makes Robot Upper Bodies So Difficult?

Li Feng: Let's move on to some specific questions. Last time when chatting with Professor Zhang Wei, we joked that he's an expert in robot lower bodies. You've mainly focused on robot upper bodies, especially hand-brain coordination, dexterous hands, and end-effector manipulation. If we look at humanoid robots holistically and simply divide them into upper and lower bodies, the lower body locomotion has seen some positive developments and progress. So where does upper body technology stand today?

Lian Wenzhao: I agree with this upper-lower body distinction. For the upper body, especially hand manipulation, the situation is much more complex. Hands must interact with the environment; they need to transfer objects from an initial state to another state, which requires forceful contact. We even need to actively explore these contacts rather than treating them as disturbances. Therefore, for hands, we need to pay attention not just to objects' geometric properties but also their semantic properties — what are these objects, what are they used for? This involves so-called semantic meaning or affordance, requiring deeper semantic understanding. This connects to the commonsense and intuitive physics we discussed earlier, involving language-level understanding — this is the more challenging aspect.

Li Feng: For example, if there's a round object on a table, we need to determine whether it's a decoration or a plate for food.

Lian Wenzhao: Right. A more intuitive example: after we finish eating, there's a lot left over — do we throw it out or wrap it in plastic film and put it in the refrigerator? This kind of judgment requires deep semantic understanding of objects.

Li Feng: Now that we're promoting conservation, we'd definitely wrap it and put it in the fridge. So currently, robot upper bodies, especially hand manipulation, still face many specific challenges and problems. I think we can divide this into two issues. The first is, as you just mentioned, the human body uses reflex arcs to adjust balance. So for the upper body, especially dual-arm or multi-arm coordination, how can we use two arms simultaneously while maintaining immediate balance?

Lian Wenzhao: If we break down part of the upper body problem, such as movement and obstacle avoidance, this is essentially equivalent to controlling the lower body — we want to use minimum force to maintain stability or walk faster. If we only consider the upper body's spatial movement capability, the nature of the problem is the same. For example, navigating an electrified maze: we need to pick up a ring without touching the current. Tasks like this can be abstracted; the principle is similar to controlling the lower body — making the robot's control very precise to avoid touching obstacles. As for how two arms coordinate, this problem can be solved. But in unknown environments, things become much more complex, requiring coordination analogous to the cerebrum and cerebellum.

Li Feng: For example, a robot needs to shake a test tube with one hand while pouring a specific amount of reagent with the other, without spilling, while ensuring this action doesn't affect the shaking of the test tube.

Lian Wenzhao: If the environment is fixed, like shaking a bottle, this kind of task is relatively simple because it's essentially repetitive motion. If it's a complex situation, requiring the robot to perform real-time state assessment and synchronously adjust dual-arm coordinated movements, then the robot needs to clarify the target position; even if the target isn't clear, it needs to estimate the next target position. This is actually much harder. Analogous to the lower body: predicting where your foot will land next moment to achieve balance, versus knowing exactly where to place your foot and placing it — these are different things.

Li Feng: So what is the current level of bimanual coordination? What do we hope to solve but haven't yet?

Lian Wenzhao: Uncle Feng is now asking about the academic frontier. Bimanual coordinated manipulation is a topic that academia has begun paying attention to again in recent years. In industrial robotic arm applications, dual-arm coordination in unstructured environments is rarely involved. For example, in automobile manufacturing, there may be multiple robotic arms working simultaneously, but their trajectories are offline-planned fixed trajectories, not adjusted in real-time online.

Achieving multi-arm coordination in unknown environments remains an academic challenge. Because the action space of upper limbs is very large, requiring real-time state estimation and coordination of multiple degrees of freedom.

Li Feng: Let me give another specific example: how to put a giraffe in a refrigerator. This process can be divided into four steps: open the refrigerator door, take out the elephant, put in the giraffe, then close the refrigerator door. Is this kind of coordinated process theoretically feasible?

Lian Wenzhao: Large models currently do fairly well at this high-level task decomposition. There are many examples of this kind of data online — for instance, there are usually tutorials guiding how to complete certain tasks, or instruction manuals that detail every step of an operation, like "how to open a phone packaging box," "how to repair a computer." So large models perform relatively well at task decomposition; the harder part is how to actually implement each step.

Li Feng: Returning to the industry level, let's set aside the dual-arm collaboration issue for now. Has single-arm control, degrees of freedom, dexterity, and so on achieved certain progress?

Lian Wenzhao: It depends on the standard. For single-arm manipulation, if the environment is semi-constrained, the problem is relatively easier to solve. But in dynamic environments, to my knowledge there aren't yet truly deployed real-time motion planning and complex scenarios. Parcel sorting is probably one of the most challenging scenarios that can actually be done.

Li Feng: So in the next two years, is it possible for us to achieve large-arm and forearm operations in larger dynamic spaces?

Lian Wenzhao: Take Google X as an example: we developed furniture assembly robots for partners. The robot needed to visually estimate the state of various components — for example, the panels of a cabinet, the parts of a chair like the backrest, armrests, wheels — and plan in real-time how to grasp and assemble them. I go to grab the handle, but can't block where the screw needs to be tightened. This involves many mid-level to low-level decisions, so it's very difficult.

Li Feng: So the chair assembly project you just mentioned wasn't done following a manual, but rather the robot dynamically estimated and operated on its own?

Lian Wenzhao: When we implemented this function, large models hadn't emerged yet, so we did have a manual; the steps were known. The greater challenge at the time was how to implement each step, because furniture components are large and heavy, with high precision requirements — for example, aligning screws with nuts. This was very challenging.

Li Feng: Understood. So, what level has robot hand technology reached now?

Lian Wenzhao: Currently, the technological path for hands hasn't fully converged either, though fortunately consensus is forming — developing a multi-fingered dexterous hand, not just a simple two-finger gripper.

Two-finger grippers are already widely used in industry, but they're actually more like suction cups because their decision-making is very simple: just one point, or in 3D space two or three points depending on your constraints. Once you increase to two fingers, you add at least one or more rotation angles, and the problem becomes more complex. If controlling five fingers, the manipulation space becomes much larger, and considering real-time adjustments are also needed, this is where algorithms currently struggle.

At present, on one hand algorithms haven't fully caught up; on the other hand, there aren't good hands available. For example, Shadow Robot developed a hand with 20 degrees of freedom quite early on. This hand is large and mainly used by research institutes and university labs. Usually they buy at least two of these hands — one in use in the lab, and the other either being shipped to Shadow or shipped back, because it's broken. After all, once you increase complexity, reliability inevitably decreases. So everyone is still in the exploration phase; they haven't reached the stage of focusing on reliability, but rather are hoping to make it work first.

The "Impossible Triangle" in Robotics

Li Feng: For hand design, if we consider convenience, energy consumption control, complexity, possible task scenarios, and interaction with humans, how many hands do you think a robot should ultimately have, and how many fingers on each hand would be reasonable?

Lian Wenzhao: That's a very good question. Theoretically speaking, if we ignore power consumption for a moment, more hands are better, because sometimes two hands genuinely aren't enough — you might need a third hand to help. But in practice, two hands can usually cover 90% or even more than 99% of use cases, which may already be sufficient. Of course, we could add more hands, but whether that's necessary still needs further discussion.

In current technology and application scenarios — such as factories, logistics, and services, especially elder care which we often discuss — two hands can typically meet the demand.

As for the number of fingers, four fingers may be enough in many situations, because in real life we rarely use our pinky fingers.

For example, when assembling furniture, we only used grippers and suction cups. We also once partnered with another company to do TV testing. The TV production process itself is already highly automated, but at the end, several people are needed to plug HDMI cables, power cords, and USB cables into the back of the TV, then run some tests to check for dead pixels and color accuracy. Although this requires relatively high precision and operates at a fairly fast pace, and we hoped to achieve some degree of generalization, at the time we only used two-finger grippers.

If we need to handle dozens or hundreds of wire connectors, a two-finger gripper won't be enough — we'd need to add more fingers. Since we're adding anyway, we might as well add several more. With multiple fingers, we might even be able to operate an electric screwdriver if necessary.

Li Feng: It sounds like upper limbs are indeed more difficult than lower limbs at this stage. Last time, Professor Zhang Wei mentioned that with the help of large models, reinforcement learning can play a significant role in lower limb locomotion. It's like a child learning to walk — walk well, and you get a lollipop from mom and dad; trip and fall, and you learn a lesson too. So for upper limb training, are there any technical breakthroughs currently? Especially in fine manipulation and coordinated movement, are there new algorithms or models that can help us better train robot upper limbs?

Lian Wenzhao: On the algorithm side, we have indeed made some encouraging progress. The example you just mentioned is quite good — when a child learns to walk, they may fall, but gradually they learn, and this is achieved through reinforcement learning.

Upper limb operations may be mixed — partly reinforcement learning, but more so imitation learning. The more complex the task, the slower the learning progress if we try to practice through reinforcement learning or trial-and-error.

To some extent, the less fundamental the task — that is, the more it moves toward scenarios, applications, and specific tasks — the greater the proportion of imitation learning. Honestly, in these cases, we often don't have the time, energy, or patience to slowly learn through trial-and-error and reinforcement. Sometimes, as parents, we can't help but teach them, demonstrating exactly how to do it. Children have remarkable imitation abilities; they can absorb information from parents and friends' demonstrations. They have a kind of empathetic capacity — when they see how a third party does something, they map it onto themselves and think about how to control their own arms and hands to accomplish the task.

So reinforcement learning is better suited for foundational capabilities — for example, how to get from point A to point B, walking stably and well, with low power consumption and safety. These foundational capabilities, which we might call "reflex arc abilities," are well-suited to reinforcement learning. But once we want to execute specific tasks, dealing with environmental uncertainty and semantic-level unknowns, imitation learning is probably the more suitable strategy.

Li Feng: So what challenges and problems does imitation learning face in practice today?

Lian Wenzhao: That's a very good question. Under current cost constraints and hardware-software limitations, robots struggle to achieve three goals simultaneously. The first is success rate or reliability — can we achieve 99% or even more nines of reliability? The second is speed — we want robots to complete tasks quickly, hoping they can be as fast as humans, or even faster. The third is generalization — we want robots to handle multiple tasks, rather than reprogramming or designing new tools for each task.

Li Feng: This is like the "impossible triangle" in robotics.

Lian Wenzhao: Exactly. Applying this to the field of imitation learning, let's carefully consider the example of Stanford University's shrimp-cooking robot.

Li Feng: Not savoring the shrimp itself?

Lian Wenzhao: The success standard for cooking shrimp is relatively lenient, because shrimp can actually be eaten raw. But if we demand higher success rates and precision, we need more data and more complex models.

However, in this regard, we've also seen some positive results. We once had a robot learn through imitation learning to insert a AA battery into a battery holder. Battery holders are usually tight, with springs, and the space is very cramped. We had a human demonstrate how to press down the spring, then twist and turn slightly, and finally push the battery into the holder.

Even for such tedious, fine, and complex movements, the robot can learn from a human perspective and generalize the learned strategy to other batteries of different shapes, colors, and sizes. This shows that while maintaining certain reliability, we can also achieve some degree of generalization. So I believe the imitation learning approach still has quite a few more low-hanging fruits to pick, and the ceiling is still very high.

Li Feng: Returning to an earlier question we mentioned — does human intuitive physics or basic rules affect imitation learning outcomes?

Lian Wenzhao: It definitely does. An intuitive example: when we discuss which is heavier, a pound of cotton or a pound of iron, intuitively we might think iron is heavier, even though this is the wrong answer, it's also somehow right. If we're walking on the road and both cotton and iron fall, and we must choose one to be hit by, we'd definitely prefer cotton.

Such examples are also common in robotics. In complex multi-object environments, robots need to decide where their hands or arms should move, and need to explore which objects can tolerate contact. This is very subtle knowledge that we use daily. In imitation learning, how to teach this knowledge to robots is particularly important.

We often discuss how robots can interact with or accompany the elderly in completely open environments, or care for children. Executing these tasks requires not only understanding geometric meaning but also semantic-level information. Only by achieving this level of generality, having common sense, can robots truly interact effectively in unknown environments.

Li Feng: From what I've heard, from a business perspective, developing upper limb coordination and dexterous hand manipulation control is currently quite meaningful. At the same time, considering the impossible triangle, we should use imitation learning to achieve better control on the control side, but preferably in environments that aren't completely open — that is, in semi-closed or fully closed scenarios — performing operations with some degree of generality.

We invested in an orthopedic robotics company that basically does carpentry work. Once a person's leg is fixed, the remaining problem is that it needs to drill a hole first, then chisel out the bone that needs replacing at a certain angle, and then install the artificial bone at a certain angle and secure it firmly. This basically belongs to a completely closed environment, because the leg cannot move — if the leg moves, they'll drill in the wrong place.

Lian Wenzhao: This is an excellent summary. In the impossible triangle, if we start from practical application and want to achieve some results, we must look at which vertex of the triangle we can compromise on.

For example, in some cases, speed may not be a critical requirement — we can accept a robot working in a space for 24 hours, as long as it completes the task, without caring about speed. A counterexample to this would be logistics. Vicarious once worked on high-speed sorting and packing in logistics, where we faced the challenge of completing 1,200 operations per hour, one every 3 seconds, with no time for large models to train or predict various situations.

Another vertex is reliability. In some scenarios, the cost of error is very high — autonomous driving is an extreme example where errors are absolutely not allowed. We can also find scenarios that are relatively friendly to robots or where the cost of error is not high, such as pouring water, where pouring a bit more or less doesn't matter much.

The third vertex is generalization, which is the most flexible and friendly standard. As you said, it's a controllable or semi-controllable range. For example, having a robot hammer nails at a fixed station in a factory — this task is extremely controllable, while there are also many extremely uncontrollable scenarios and tasks. We can find a suitable and comfortable middle ground that happens to fit our current imitation learning methods to achieve the desired generalization.

For example, air conditioning compressors come in different models and sizes — can we generalize across them? Also, their locations or environments — can we generalize across those? There are many similar dimensions where generalization can be achieved, and we can usually find adapted methods.

Li Feng: This is quite interesting — it reminds me of autonomous driving technology in recent years. The algorithms were impressive, but they struggled to land in practice. After countless attempts and failures, people gradually realized that in the near term, the most achievable autonomous driving application scenarios were mainly in semi-closed or fully closed environments like ports, mining areas, and campuses. Ports and mining areas tend to have few people, with fixed routes and clear traffic rules; campuses are similar, the difference being more people, but most are on sidewalks. Automatic parking is another scenario — when most cars aren't moving, it plans a route and parks the car in an available empty space.

Lian Wenzhao: This is an excellent analogy. Indeed, when autonomous driving technology first emerged, everyone wanted to build robotaxis that could go anywhere. But as time went on, people found this extremely difficult. Now, technology is actually developing very quickly, and we're seeing the gradual adoption of autonomous driving — for example, Tesla's V12 and Waymo's efforts in San Francisco and the Bay Area. These developments show that autonomous driving technology is gradually maturing.

However, for researchers and practitioners in this industry, a considerable challenge is the need to constantly make choices and adjustments, because technology迭代速度非常快, and in the face of visible progress, we need to carefully study and assess whether it can truly deliver on its promises. If you believe that existing data and computing power are still far from touching the ceiling of these technologies, then we can foresee that technology will improve rapidly.

An Unprecedented Opportunity

Li Feng: Has the rising interest in humanoid robots had any personal impact on you, beyond meeting more investors and industry professionals? Any other changes?

Lian Wenzhao: For those of us doing technology R&D, this is a very good time. Unprecedented — who knows if there will be another like it.

Li Feng: There definitely will be. Major developments like this, as we just discussed with autonomous driving, will certainly go through many cycles of bubbles rising and bursting. In this spiral development process, their sustainability will increase, technological progress will accelerate, and the scope of application will expand.

Lian Wenzhao: This time I really saw this excellent opportunity, so I made up my mind to return from the Bay Area to do something. I believe robotics and AI are less of a scientific problem and more of an engineering problem. Since it's an engineering problem, we definitely hope to be as close to application as possible, and to create value in the real world, improve productivity, and raise people's living standards.

If we can close the loop on these key points, it will provide the clearest possible signal for technological development. We need to continuously adapt our technology to meet these demands. That's why I hope to stay closer to industry and achieve real-world deployment. As mentioned earlier, "technology-driven, industry-anchored" — only through actual deployment can we abstract better scientific questions and drive technological progress through engineering challenges. Personally, I also hope to create value that carries genuine social significance in this area.

/ 07 / The Outlook for Humanoid Robots

Li Feng: When we started today's conversation, we took a detour through the history and development of robots. If we draw an analogy between where humanoid robots stand today and the starting point of first-generation industrial robots 60 or 70 years ago, what do you think the global landscape for humanoid or intelligent robots will look like 10 to 20 years from now?

Lian Wenzhao: I'd even expand this question beyond national boundaries — we should be thinking about interplanetary space, like Mars and other planets. Of course we wouldn't want to send millions of people to build environments on Mars; we'd send humanoid robots instead. This would be tremendously valuable for all of humanity.

So regardless of country, in this era — setting aside war and other factors — the development of humanoid robot technology is extraordinarily valuable for humanity's expansion beyond Earth, for exploring other planets and even interplanetary space.

Coming back to concrete questions from this perspective: Should China develop robotics? Should the US? Clearly both are doing so now, because productivity keeps advancing.

Human beings essentially want to live well while doing as little work as possible, and we long for immortality. To achieve these goals, besides developing the robotics industry, we also need to develop controlled nuclear fusion, interplanetary transport, aerospace technology, and so on. We need limitless energy and productivity for production and services. Development in these areas is an inevitable trend. From the demand side, this is already abundantly clear.

From the supply side, whether our technology can keep up is the challenge facing those of us in this industry. As a researcher in this field, I feel deeply honored and fortunate to be able to participate in these endeavors during this era, to do practical and meaningful work.

Li Feng: Finally, a personal question from me. I really enjoy playing badminton — eight hours every week. Let's say I wanted a professional-level badminton robot as a training partner, not just one standing in a fixed position feeding and returning shots, but one with its own strategy and technique, that could play with me without completely exhausting me. How long do you think it would take to build such a robot?

Lian Wenzhao: Actually, from a technical standpoint, I'm quite optimistic. Google X previously worked on a table tennis robot project. It wasn't a humanoid robot — it used rails and a robotic arm, tracking the ball visually and using reinforcement learning algorithms to continuously update its strategy. The robot's actual performance was quite good, reaching training-partner level.

Of course, badminton does have higher requirements — more movements, faster decision-making, and greater strategic demands. But looking at the current technology stack, there aren't obvious bottlenecks. There's no "rocket science" problem that can't be solved. I believe robots are fully capable of accomplishing this task.

Li Feng: As a cliffhanger for our audience, and as an open question: If you were to begin industrializing, which direction would you choose as your first step in commercializing intelligent robots?

Lian Wenzhao: I'll give an open-ended answer. Based on the impossible triangle we discussed earlier, I think we should prioritize improving robots' generality and generalizability while ensuring reliability — I'm willing to sacrifice some speed.

At the same time, I'd want to choose scenarios with high demands for flexibility and adaptability. To give a counterexample: screwing or hammering nails are tasks that require speed, low flexibility, and high reliability. I'd want to leverage my expertise in imitation learning to tackle application scenarios that remain unlocked precisely because they urgently require flexibility.

Li Feng: Thank you very much for sharing with us today. We now have a much better understanding of humanoid robot upper bodies. Combined with Professor Zhang Wei's introduction to lower-body progress in our previous conversation, it's as if we've connected the whole robot together. We've already invested in several companies focused on intelligent robotics, including Covariant, which does unordered sorting; Inspire Robots, which makes dexterous hands; and a company working on six-axis force sensors at the chip level. Going forward, we look forward to opportunities to share these companies' progress in their respective specialized directions.

Reader Engagement

Over the past year, what videos of new robotic advances have come across your feed, and which one left the deepest impression? Share with us in the comments. By 17:00 on April 26, the five most thoughtful commenters will each receive a FreeS Fund industry research handbook and a copy of The Age of Discovery.

▲ Li Feng in Conversation with Zhang Wei, Founder of LimX Dynamics: Humanoid? Robot? | FreeS VC Dialogue

▲ Launch a "Search for Truth" and "Pursuit of Rigor" Over One Summer | 2024 FreeS Fund Internship Program Now Open

▲ From 1/10 to 1/3 Market Share: How Did Domestic Industrial Robots Achieve Explosive Growth? | FreeS VC Dialogue

▲ Moving Into 2024: How We Think About AI Entrepreneurship and Investment | FreeS Year-End Special

▲ After ChatGPT Went Viral, Where Does AIGC Go From Here? | FreeS Report 28

▲ FreeS Report 20 | Learning from History: Why We're Bullish on Industrial Robots?

Star the FreeS Fund WeChat Official Account for timely business insights delivered to your inbox