
A Conversation with Google DeepMind and LLM Researchers: Deconstructing OpenAI o1 and the LLM+RL New Paradigm
December 27, 2024
The most hardcore technical breakdown of OpenAI's o1 model that listeners have been waiting for is here! When it comes to the most talked-about event this year, the release of OpenAI's o1 model is impossible to ignore — OpenAI CEO Sam Altman himself called it the beginning of a new paradigm. Through reinforcement learning (RL) combined with Chain-of-Thought (CoT) technology, o1 now tackles complex problems in physics, mathematics, and programming at a level comparable to PhD students in those fields.
How does reinforcement learning bring new logical reasoning capabilities to large language models? What are the sources, implementation methods, and future potential of this capability? What impact will the "new paradigm" brought by o1 have on the industry? This three-plus-hour deep dive will offer you a fresh perspective!
For this episode, we've invited Monica, Investment VP at ZhenFund, along with several frontline researchers with hands-on LLM training experience. Two of them hail from Google — the undisputed stronghold of RL and the birthplace of world-leading RL work including AlphaGo, AlphaFold, and AlphaGeometry. Both have extensive research and practical experience in RL and MCTS (Monte Carlo Tree Search). Another guest comes from a major internet company, with firsthand experience spanning LLM pre-training to RLHF. The combination of cutting-edge perspectives from China and the US sparked plenty of insights. The lineup's speculations and analysis of o1 will not disappoint.
This discussion will dive deep into technical details. Given our guests' long overseas work and study experience, English will inevitably be sprinkled throughout — no complaints accepted. Enjoy!
(Recorded September 27, 2024)
Host: Monica Xie, Investment VP at ZhenFund Co-host: Cage, former ByteDance data scientist, now researcher at Shixiang Tech, contributor to the "Overseas Unicorn" WeChat account
Guests:
- Kimi Kong: Research Engineer at Google DeepMind. First encountered reinforcement learning while studying at Stanford. Research spans robotics to large language models, with a systematic understanding of RL theory and its evolution.
- Eric Li: PhD from Caltech, now Research Scientist at Google Cloud. Many speculate that o1 applies Monte Carlo Tree Search (MCTS) to LLMs as a key method for enhancing logical reasoning. Eric has published multiple papers on LLM-MCTS integration and is a recognized expert in the field.
- Su Hui: Former WeChat AI researcher, now head of large models at a top domestic internet company.
Timeline
Data and Reinforcement Learning: o1 Experience and Capability Breakdown
- 02:45 Guest introductions and recent projects
- 24:19 Using o1 feels like "supervising a mediocre but not entirely incompetent grad student"
- 26:43 Scaling annotation and filtering for high-quality data is the fundamental challenge
- 31:41 o1's enhanced reasoning in travel scenarios demonstrates higher generalization in common-sense domains
- 39:44 High-quality data can no longer be one-shot; requires long-chain feedback optimization
- 45:07 Designing a good reward model is crucial for obtaining high-quality reasoning preference data
From Model Tool Use to Building Stronger Agents
- 51:32 o1 was trained in a strong reasoning environment, accustomed to long thought chains
- 55:08 Model reasoning capability is just the foundation for agents; building an agent system faces far more challenges
- 57:23 Four key elements for building agents: Base Model, Tool, Prompt, Learning
CoT, MCTS, Self-play, and RL's Role in LLM Reasoning
- 01:00:35 What is Chain-of-Thought (CoT)?
- 01:09:43 CoT and MCTS are fundamentally both explorations in planning and reasoning
- 01:13:13 Three core elements of reinforcement learning: Agent, Environment, Reward
- 01:33:24 A scalable pattern for Reward Models might be "Human in the loop combined with AI feedback"
Is o1 Likely a Single Model or Multi-Agent System?
- 01:36:17 In the five stages of AI development, we're still at roughly 2.1 to 2.5
- 01:39:32 In the foreseeable future, Multi-Agent capabilities will exceed Single-Agent
- 01:42:54 Multi-Agent is a transitional state before super-models emerge
- 01:44:40 Why gaming ability and game data deserve attention for LLMs
Impact and Outlook of o1's Release
- 01:49:59 Google started RL research early — why did OpenAI's o1 come out first?
- 01:56:26 OpenAI o1's new RL paradigm raises the bar for followers
- 02:05:17 Most hoped-for developments in AI over the next 1-3 years
Further Reading Full transcript: 30,000-Word Dialogue with Google DeepMind Researcher: Deconstructing OpenAI o1 and the LLM+RL New Paradigm | Z Talk
- OpenAI: Scaling Laws for Reward Model Overoptimization
- Allen Zhu: Physics of Language Models
- Language is primarily a tool for communication rather than thought
- OpenAI: Improving mathematical reasoning with process supervision
- OpenAI PRM 800k dataset
- Let's Verify Step by Step
- Anthropic: Constitutional AI: Harmlessness from AI Feedback
- OpenAI Hyung Won Chung: "Don't teach. Incentivize."
- Sergey Levine: Soft actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- MADDPG (Multi-Agent Deep Deterministic Policy Gradient) from OpenAI paper: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
- AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training
- Reasoning with Language Model is Planning with World Model
- Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
- OpenAI: Learning to Reason with LLMs
- OpenAI o1-mini
- OpenAI's Strawberry and inference scaling laws
- Overseas Unicorn: LLM Paradigm Shift: RL Brings New Scaling Law
- Zhang Junlin: Reverse-o1: OpenAI o1 Principle Reverse Engineering Diagram
Key Concepts
Reward Model: In reinforcement learning, a Reward Model is used to estimate or predict reward signals in an environment. It simplifies or replaces environmental feedback in scenarios where reward signals are complex, sparse, or difficult to directly define.
RLHF: Reinforcement Learning from Human Feedback — optimizing models through human-provided feedback.
RLAIF: Reinforcement Learning from AI Feedback — optimizing models through AI-generated feedback.
InstructGPT: A language model developed by OpenAI, optimized from GPT-3 to better understand and execute user instructions through human feedback. It marked a milestone in shifting from pure large-scale language generation to more precise understanding of user needs.
SFT Data: Annotated datasets used in supervised fine-tuning.
Preference Data: Data expressing user or annotator preferences, particularly used in RLHF and similar reinforcement learning phases.
Chain-of-Thought: A method to enhance LLM reasoning by guiding models to generate step-by-step reasoning, breaking complex problems into manageable sub-problems. It significantly improves performance on mathematics, logical reasoning, and multi-step problem solving.
Toolformer: A language model combining generation capability with tool use, automatically deciding to call external tools (calculators, search engines, translators) when external knowledge or specific computation is needed.
DDPM: Denoising Diffusion Probabilistic Model, a generative model for high-quality image generation and an important recent advance in generative modeling.
DPO: Direct Preference Optimization, an emerging machine learning method that directly uses human preference data to optimize model outputs.
PPO: Proximal Policy Optimization, a reinforcement learning algorithm proposed by OpenAI in 2017. Known for stability and efficiency, it's widely used in game AI, robot control, and language model training (including RLHF).
Reward Hacking: When optimizing against a reward model's scores, if the reward model doesn't fully represent human preferences, models may exploit loopholes to maximize rewards without achieving intended outcomes.
AlphaGo: Go AI developed by Google DeepMind, combining supervised learning from human games and MCTS. Defeated Lee Sedol in 2016 and world #1 Ke Jie the following year.
AlphaGo Zero: Evolution of AlphaGo trained solely through self-play and RL, without human game records, achieving significantly stronger performance.
AlphaZero: Generalized version of AlphaGo Zero applicable to multiple board games (chess, Go, shogi), learning to become superhuman AI through self-play with only basic rules provided.
Staff Executive Producers: Wendi, Stone, Zoe Post-production: Keyone Studio
About ZhenFund "This Is Serious" (此话当真) is a general business podcast produced by ZhenFund, where the investment team shares the latest trends and industry insights with leaders across fields.
Founded in 2011, ZhenFund is one of China's earliest angel investment firms. From the start, it has actively sought outstanding entrepreneurial teams and era-defining investment opportunities in artificial intelligence, chips and semiconductors, robotics and hardware, healthcare, enterprise services, new energy, cross-border expansion, and consumer lifestyle.
ZhenFund — Your First Stop for Entrepreneurship!
Contact Us WeChat: 真格基金 (ID: zhenfund) Website: www.zhenfund.com Email: media@zhenfund.com
Listen on Xiaoyuzhou, Apple Podcasts, and Ximalaya. We welcome your feedback and suggestions in the comments!