E12. Gao Yang: A Thinking Reed, A Machine That Acts | Beipo Initiative

April 8, 2026·299·6

🎙️ Episode Introduction

Since the start of the year, SpiritAI has closed two consecutive funding rounds totaling nearly 3 billion RMB.

Embodied intelligence is rapidly becoming consensus among capital markets and industry.

Yet as new hotspots and narratives keep emerging, Gao Yang's focus remains fixed on one internal question: If we're to build a general-purpose foundation model for embodied intelligence, what problems lie ahead? And how do we actually solve them?

For the first episode of North Slope, we invited Gao Yang, co-founder of SpiritAI and assistant professor at Tsinghua University, to discuss embodied intelligence's approaching "GPT-3 moment." From data to models to system capabilities, we unpack the technical premises behind this prediction and what lies beyond. We also bring the conversation back to the individual—Gao Yang was the slowest-speaking guest we've ever had. He says: humans are but thinking reeds, unhurried, not trying to become any "standard answer," but continually returning to oneself between external value choices and individual happiness, maintaining an inner rhythm, unfolding one's own vitality.

👤 Guest Bio

Gao Yang: Co-founder and Chief Scientist of SpiritAI, Assistant Professor at the Institute for Interdisciplinary Information Sciences, Tsinghua University.

He received his bachelor's and master's degrees from Tsinghua University and his PhD from UC Berkeley. He is a leading young scholar globally in embodied intelligence and Vision-Language-Action (VLA) models.

🕒 Selected Timestamps

04:34 In early 2024, when he talked about embodied foundation models, even his students didn't believe him

07:15 The moment ChatGPT emerged, his AI values formed at Berkeley were reshaped

08:37 If the large language model path worked, why not embodied intelligence?

13:11 Two years ago he predicted 5–8 years; now he's moved that up to 2027

17:52 Ten million hours of data, 6,000 people, several months. China has experience with this

29:42 For evaluating an embodied model today, there's only one metric that matters: generalization

32:48 His academic "brother" Sergey and him—convergence and divergence in technical approaches

39:26 The robot of the future is a "multi-spectrum" system

48:36 If there were an elixir of immortality, would he still spend a lifetime on robots?

01:06:31 For scientist-founders: what's signal, what's noise?

01:08:34 Laozi's "frugality" isn't about thrift—it's about non-dissipation

01:11:32 "Value or happiness? I choose happiness"

📚 References (a lot this time, but all valuable)

On technology:

Scaling Law: Proposed by OpenAI in 2019. Refers to the regular improvement in model performance as compute and data increase; embodied intelligence is now exploring its boundaries on physical data.
VLA (Vision-Language-Action): An end-to-end embodied intelligence architecture that enables robots to "see" their environment, "understand" instructions, and directly output physical actions.
World Model: An AI model that can understand and predict the next state of the physical world; in the future, it may generate massive amounts of robot training data in simulated environments.
Teleoperation: Remote operation. Refers to humans controlling robots remotely through devices; some seemingly intelligent robot demonstrations on the market currently rely on this technology.
Locomotion: Motion control. The underlying movement and balance capabilities of robots, with extremely high control frequency, similar to biological reflexes.
Transformer: The current universal underlying architecture for large models. It functions like a highly sensitive "attention converter," capable of capturing associations across vast spans in data sequences—the common foundation for both ChatGPT and embodied intelligence "brains."
End-to-End: A "direct" technical approach. The model goes straight from raw input (e.g., camera footage) to final output (e.g., robotic arm movement), without intermediate human-designed rules, letting the machine learn the mapping itself.
Generalization: The core metric for measuring embodied intelligence quality. Refers to AI's ability to make correct judgments when facing unseen environments or tasks, rather than mechanically repeating lab-practiced movements.
Universal Function Approximator: Foundational mathematical theory of neural networks—neural networks with hidden layers can approximate any continuous function in the world.
CRISPR: Gene editing technology. Used in the episode to imagine an extreme sci-fi scenario where humans might dramatically extend lifespan through genetic modification.
PR2 (PR two): A classic dual-arm research robot, an important platform for early scholars conducting grasping and control experiments.

On companies:

Generalist: A cutting-edge global embodied intelligence startup, leading the industry in real-world physical data collection. In its latest demo released on April 2, it claimed to have 500,000 hours of data.
Physical Intelligence (PI): A top-tier American embodied intelligence startup emphasizing the "generality" of embodied intelligence models.
World Labs: An AI startup founded by Fei-Fei Li, now pivoting to focus on embodied intelligence and "spatial intelligence" R&D.
AMI Labs: Founded by Yann LeCun, dedicated to exploring more general-purpose AI architectures.

On people:

Sergey Levine: Professor at UC Berkeley, co-founder of Physical Intelligence, described by Gao Yang as a "living Wikipedia" of robotics.
Peter Thiel: Silicon Valley's famous investor, PayPal co-founder, and author of the renowned Zero to One.
Jitendra Malik: A towering figure in computer vision at UC Berkeley. His evolutionary perspective on "why animals need vision" inspired Gao Yang's turn toward robotics research.
Wu Yi: A distinguished young scholar at the Institute for Interdisciplinary Information Sciences, Tsinghua University, Chief Scientist of Ant Group's Reinforcement Learning Lab, responsible for large model reinforcement learning research.
Xu Huazhe: Assistant Professor at the Institute for Interdisciplinary Information Sciences, Tsinghua University, and Gao Yang's academic "brother" from their time at the Berkeley lab. His research focuses on embodied AI theory, algorithms, and applications, deep reinforcement learning, and robotics.
Fei-Fei Li: Stanford professor, pioneer in computer vision, initiator of ImageNet, founder of World Labs.
LeCun: Yann LeCun, one of the "three giants" of deep learning, Turing Award winner.

On ideas:

Thinking reed: Derived from French philosopher Pascal's metaphor. Humans are as physically fragile as reeds, but possess irreplaceable value through independent preferences and the capacity for thought.
Laozi's "three treasures": From the Tao Te Ching: "First, compassion; second, frugality; third, not daring to be first in the world." In the episode, "frugality" specifically means not dissipating one's mental energy and desires.

🎵 Music

Jordan Critz - Beau Et Rapide (Piano)

🎤 Production Team

Host | Jinjian Zhang

Produced by | Oasis Capital

Editing & Production | Shengdu Studio Podcast Workshop

💬 Engagement

Assistant WeChat: VB20240606

If you faced two choices: one with enormous "value" in secular terms but that brings you pain, and another that brings you genuine "happiness" but seems useless—which would you choose? Leave a comment!

We'll gift the book Gao Yang mentioned at the episode's end to the 3 listeners with the most-liked comments.

Disclaimer

All investment-related content in this podcast is for exchange and sharing purposes only, for reference, and does not constitute any market prediction, judgment, or investment or consulting advice. Thank you for your interest in original content! If reposting or quoting content from this podcast, please indicate the source. Please contact Oasis Capital and obtain consent before reposting.

View episode transcript on Xiaoyuzhou