Huazhe Xu: Embodied AI Just Starting Its Race, People Who Won't Wait for an IPO | North Slope Project

Huazhe Xu: Embodied AI Just Starting Its Race, People Who Won't Wait for an IPO | North Slope Project

May 28, 2026

🎙️ Episode Introduction

Episode 2 of North Slope. We sat down for an in-depth conversation with Huazhe Xu — founder of Poke Robotics and assistant professor at Tsinghua University.

Over the past two years, embodied intelligence has become one of the hottest sectors in tech, with the entire industry hurtling into a kind of "acceleration narrative." But in Xu's view, most of today's progress has only covered the "first kilometer" toward true generality. Many fundamental problems haven't even been seriously tackled yet: where does large-scale data come from, how do systems generalize, how are long-horizon tasks executed stably, how can complex physical interactions in the human world be genuinely understood.

In a sense, embodied intelligence isn't a sprint-style technological explosion. It's more like an ultra-long-cycle marathon requiring continuous iteration. And rather than asking "when do we win," Xu cares more about when the thing he wants to do can finally begin.

"Doing what you truly want to do — that's the most expensive privilege."

Because what genuinely draws him in has never been results already achieved. It's the things not yet formed, the continuous venturing into the unknown, the relentless pushing of boundaries.

When the conversation turned to music, Xu mentioned his love for Beethoven's Eroica symphony. That sense of mounting up and fighting on, again and again — utterly captivating.

In some ways, he's that kind of person himself.

Never stop tinkering. Never stop fighting.

👤 About the Guest

Huazhe Xu: Founder of Poke Robotics; assistant professor and PhD advisor at the Institute for Interdisciplinary Information Sciences, Tsinghua University; head of the Embodied Intelligence Lab.

B.S. from Tsinghua University's Department of Electronic Engineering; Ph.D. from UC Berkeley; postdoc at Stanford University. Long focused on embodied intelligence, reinforcement learning, and robotics. In 2023, he helped incubate the embodied intelligence company Xinghai Tu, and is regarded as one of the representative young figures among China's new generation of embodied intelligence researchers and entrepreneurs.

🕒 Selected Timestamps

03:43 "Market conditions aren't the essence. Building the robot is the mission."

06:31 Embodied intelligence is like a marathon, but the industry may have only run the first kilometer

08:42 We believe in scaling law, but scaling law hasn't been truly "proven" yet

09:59 What is true zero-shot generalization? "Send it to me, I'll run it and see"

14:04 VLA is "not elegant enough"

17:48 From reactive to model-based: why robots need to predict the future

30:10 What China's embodied intelligence researchers truly lack is "spirit" (xinqi)

35:50 Startup teams shouldn't Frankenstein themselves to "fit the market shape"

43:18 What the strongest researchers today really want is to build something GPT-level

46:55 Berkeley's academic atmosphere: anyone can just say "This is wrong"

56:51 A lab amplifies individual shape; a company amplifies organizational shape

01:00:00 Why "living organism" is preferred over "tool" as a description of people and organizations

📚 References

Figure's extended livestream: American robotics company Figure, starting May 13, 2026, used its F.03 robot with the Helix-02 model to livestream continuously for 191 hours (~9 days), autonomously sorting 238,000 packages; in test clips, across 318 total sorting samples, success rate was approximately 99.7%.

Generalist's "one line": Refers to the "data volume vs. model performance" curve that Generalist includes with each release. The upward curve showing more data leading to higher robot success rates is currently the closest visual evidence to a Scaling Law in robotics.

UMI (Universal Manipulation Interface): An open-source robot data collection solution released in 2024 by Stanford's Shuran Song team. A human wears a gripper handle plus camera to collect data, with no expensive robot hardware required. While maintaining millimeter-level end-effector precision, it dramatically reduces data collection costs, and is widely recognized as one of the most cost-effective forms of embodied data collection in the industry.

Visual SLAM (Visual Simultaneous Localization and Mapping): A technology that allows a device to know in real time "where am I, what does my surroundings look like" using only cameras. The system analyzes changes between consecutive video frames, simultaneously estimating its own motion trajectory and reconstructing a 3D map of the environment — a foundational capability for scenarios including robot vacuums, AR glasses, and autonomous driving.

"Action Head": The final layer of a robot model, a small network module specifically responsible for "outputting actions." It attaches to an already-trained vision-language large model, translating the model's understanding of the scene and instructions into concrete actions the robot can execute (such as joint angles, gripper open/close).

WAM (World Action Model): A paradigm formally named in a May 2026 arxiv survey paper (arxiv 2605.12090), proposed by teams including OpenMOSS, which unifies "predicting how the future world will change" and "outputting robot actions" within the same network.

AMI (Advanced Machine Intelligence): An AI company formally founded in March 2026 by Turing Award winner and former Meta Chief AI Scientist Yann LeCun, with $1.03 billion in seed funding and a $4.5 billion valuation.

JEPA (Joint-Embedding Predictive Architecture): A world model architecture proposed by Yann LeCun in 2022. Unlike mainstream large models that directly generate pixels or text, it has AI predict future world changes in an abstract latent space — learning only "what will happen," not "what it looks like" — considered a route closer to how humans understand the world.

Diffusion Policy: A robot action generation method proposed in 2023 by Columbia University's Cheng Chi, Shuran Song, and others: adapting diffusion models to action generation, allowing robots to "gradually denoise from a cloud of noise and generate a smooth action sequence." Used together with Action Chunk, it is one of the most dominant methods in current imitation learning.

Action Chunk: A key technique in robot imitation learning, proposed by Stanford's Tony Zhao and others in the 2023 ACT paper: having the model output a complete sequence of dozens of future actions at once, thereby greatly improving action coherence. A core component of dominant methods including Diffusion Policy.

Chaconne: Originally a triple-meter dance from 16th-century Spain, later evolving into the most solemn variation form of the Baroque era. The "Chaconne" Xu mentioned specifically refers to the final movement of Bach's Partita No. 2 for Solo Violin, BWV 1004, approximately 14 minutes, widely recognized as one of the spiritual pinnacles of the violin repertoire.

Symphony No. 3 in E♭ major, "Eroica" (《英雄交响曲》): Beethoven's third symphony, composed 1803–1804, widely regarded as the watershed work marking the transition from Classicism to Romanticism. Beethoven originally dedicated it to Napoleon; upon hearing that Napoleon had declared himself emperor, he furiously tore up the dedication, renaming it "Eroica" ("Heroic"). At roughly 50 minutes, its scale, emotional intensity, and structural complexity far exceeded contemporary symphonic tradition, fundamentally redefining the genre.

D960 (Schubert's Piano Sonata in B♭ major): The final piano sonata, completed two months before Schubert's death in 1828, approximately 40 minutes, suffused with a calm contemplation of death, praised as "music that leads to heaven," and one of the greatest works in the piano literature.

Der Steppenwolf (《荒原狼》): A 1927 novel by Nobel laureate Hermann Hesse, alongside Siddhartha and Demian considered among Hesse's masterpieces. It tells of an intellectual who sees himself as half-man, half-wolf, alienated from bourgeois society, and his spiritual crisis and awakening — a 20th-century classic about loneliness and alienation.

🎵 Music

Jordan Critz - Beau Et Rapide (Piano)

Symphony No. 3 in E♭ major, "Eroica" — First Movement

🎤 Production Team

Host | Jinjian Zhang

Produced by | Oasis Capital

Editing & Production | Oasis Capital

💬 Join the Conversation

Assistant WeChat: VB20240606

Do you have a sense of your own shape? Do you have any good methods for calming yourself down and better hearing your own signals?

Leave a comment below. Three days after this episode is published, the top 3 most-liked commenters will receive an exclusive Oasis gift 🎁

Disclaimer

All investment-related content in this podcast is for exchange and sharing purposes only, for reference, and does not constitute any market prediction, judgment, or investment or consulting advice. Thank you for your interest in original content! If you repost or reference content from this podcast, please indicate the source. Please contact Oasis Capital and obtain consent before reposting.

View the full episode transcript on Xiaoyuzhou