A Conversation with Google DeepMind and LLM Researchers: Deconstructing OpenAI o1 and the LLM+RL New Paradigm

December 27, 2024·21·1

The most hardcore technical breakdown of OpenAI's o1 model that listeners have been waiting for is here! When it comes to the most talked-about event this year, the release of OpenAI's o1 model is impossible to ignore — OpenAI CEO Sam Altman himself called it the beginning of a new paradigm. Through reinforcement learning (RL) combined with Chain-of-Thought (CoT) technology, o1 now tackles complex problems in physics, mathematics, and programming at a level comparable to PhD students in those fields.

How does reinforcement learning bring new logical reasoning capabilities to large language models? What are the sources, implementation methods, and future potential of this capability? What impact will the "new paradigm" brought by o1 have on the industry? This three-plus-hour deep dive will offer you a fresh perspective!

For this episode, we've invited Monica, Investment VP at ZhenFund, along with several frontline researchers with hands-on LLM training experience. Two of them hail from Google — the undisputed stronghold of RL and the birthplace of world-leading RL work including AlphaGo, AlphaFold, and AlphaGeometry. Both have extensive research and practical experience in RL and MCTS (Monte Carlo Tree Search). Another guest comes from a major internet company, with firsthand experience spanning LLM pre-training to RLHF. The combination of cutting-edge perspectives from China and the US sparked plenty of insights. The lineup's speculations and analysis of o1 will not disappoint.

This discussion will dive deep into technical details. Given our guests' long overseas work and study experience, English will inevitably be sprinkled throughout — no complaints accepted. Enjoy!

(Recorded September 27, 2024)

Host: Monica Xie, Investment VP at ZhenFund Co-host: Cage, former ByteDance data scientist, now researcher at Shixiang Tech, contributor to the "Overseas Unicorn" WeChat account

Guests:

Kimi Kong: Research Engineer at Google DeepMind. First encountered reinforcement learning while studying at Stanford. Research spans robotics to large language models, with a systematic understanding of RL theory and its evolution.
Eric Li: PhD from Caltech, now Research Scientist at Google Cloud. Many speculate that o1 applies Monte Carlo Tree Search (MCTS) to LLMs as a key method for enhancing logical reasoning. Eric has published multiple papers on LLM-MCTS integration and is a recognized expert in the field.
Su Hui: Former WeChat AI researcher, now head of large models at a top domestic internet company.

Timeline

Data and Reinforcement Learning: o1 Experience and Capability Breakdown

02:45 Guest introductions and recent projects
24:19 Using o1 feels like "supervising a mediocre but not entirely incompetent grad student"
26:43 Scaling annotation and filtering for high-quality data is the fundamental challenge
31:41 o1's enhanced reasoning in travel scenarios demonstrates higher generalization in common-sense domains
39:44 High-quality data can no longer be one-shot; requires long-chain feedback optimization
45:07 Designing a good reward model is crucial for obtaining high-quality reasoning preference data

From Model Tool Use to Building Stronger Agents

51:32 o1 was trained in a strong reasoning environment, accustomed to long thought chains
55:08 Model reasoning capability is just the foundation for agents; building an agent system faces far more challenges
57:23 Four key elements for building agents: Base Model, Tool, Prompt, Learning

CoT, MCTS, Self-play, and RL's Role in LLM Reasoning

01:00:35 What is Chain-of-Thought (CoT)?
01:09:43 CoT and MCTS are fundamentally both explorations in planning and reasoning
01:13:13 Three core elements of reinforcement learning: Agent, Environment, Reward
01:33:24 A scalable pattern for Reward Models might be "Human in the loop combined with AI feedback"

Is o1 Likely a Single Model or Multi-Agent System?

01:36:17 In the five stages of AI development, we're still at roughly 2.1 to 2.5
01:39:32 In the foreseeable future, Multi-Agent capabilities will exceed Single-Agent
01:42:54 Multi-Agent is a transitional state before super-models emerge
01:44:40 Why gaming ability and game data deserve attention for LLMs

Impact and Outlook of o1's Release

01:49:59 Google started RL research early — why did OpenAI's o1 come out first?
01:56:26 OpenAI o1's new RL paradigm raises the bar for followers
02:05:17 Most hoped-for developments in AI over the next 1-3 years

Further Reading Full transcript: 30,000-Word Dialogue with Google DeepMind Researcher: Deconstructing OpenAI o1 and the LLM+RL New Paradigm | Z Talk

OpenAI: Scaling Laws for Reward Model Overoptimization
Allen Zhu: Physics of Language Models
Language is primarily a tool for communication rather than thought
OpenAI: Improving mathematical reasoning with process supervision
OpenAI PRM 800k dataset
Let's Verify Step by Step
Anthropic: Constitutional AI: Harmlessness from AI Feedback
OpenAI Hyung Won Chung: "Don't teach. Incentivize."
Sergey Levine: Soft actor-critic: Off-policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
MADDPG (Multi-Agent Deep Deterministic Policy Gradient) from OpenAI paper: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
AlphaZero-Like Tree-Search can Guide Large Language Model Decoding and Training
Reasoning with Language Model is Planning with World Model
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
OpenAI: Learning to Reason with LLMs
OpenAI o1-mini
OpenAI's Strawberry and inference scaling laws
Overseas Unicorn: LLM Paradigm Shift: RL Brings New Scaling Law
Zhang Junlin: Reverse-o1: OpenAI o1 Principle Reverse Engineering Diagram

Key Concepts

Reward Model: In reinforcement learning, a Reward Model is used to estimate or predict reward signals in an environment. It simplifies or replaces environmental feedback in scenarios where reward signals are complex, sparse, or difficult to directly define.

RLHF: Reinforcement Learning from Human Feedback — optimizing models through human-provided feedback.

RLAIF: Reinforcement Learning from AI Feedback — optimizing models through AI-generated feedback.

InstructGPT: A language model developed by OpenAI, optimized from GPT-3 to better understand and execute user instructions through human feedback. It marked a milestone in shifting from pure large-scale language generation to more precise understanding of user needs.

SFT Data: Annotated datasets used in supervised fine-tuning.

Preference Data: Data expressing user or annotator preferences, particularly used in RLHF and similar reinforcement learning phases.

Chain-of-Thought: A method to enhance LLM reasoning by guiding models to generate step-by-step reasoning, breaking complex problems into manageable sub-problems. It significantly improves performance on mathematics, logical reasoning, and multi-step problem solving.

Toolformer: A language model combining generation capability with tool use, automatically deciding to call external tools (calculators, search engines, translators) when external knowledge or specific computation is needed.

DDPM: Denoising Diffusion Probabilistic Model, a generative model for high-quality image generation and an important recent advance in generative modeling.

DPO: Direct Preference Optimization, an emerging machine learning method that directly uses human preference data to optimize model outputs.

PPO: Proximal Policy Optimization, a reinforcement learning algorithm proposed by OpenAI in 2017. Known for stability and efficiency, it's widely used in game AI, robot control, and language model training (including RLHF).

Reward Hacking: When optimizing against a reward model's scores, if the reward model doesn't fully represent human preferences, models may exploit loopholes to maximize rewards without achieving intended outcomes.

AlphaGo: Go AI developed by Google DeepMind, combining supervised learning from human games and MCTS. Defeated Lee Sedol in 2016 and world #1 Ke Jie the following year.

AlphaGo Zero: Evolution of AlphaGo trained solely through self-play and RL, without human game records, achieving significantly stronger performance.

AlphaZero: Generalized version of AlphaGo Zero applicable to multiple board games (chess, Go, shogi), learning to become superhuman AI through self-play with only basic rules provided.

Staff Executive Producers: Wendi, Stone, Zoe Post-production: Keyone Studio

About ZhenFund "This Is Serious" (此话当真) is a general business podcast produced by ZhenFund, where the investment team shares the latest trends and industry insights with leaders across fields.

Founded in 2011, ZhenFund is one of China's earliest angel investment firms. From the start, it has actively sought outstanding entrepreneurial teams and era-defining investment opportunities in artificial intelligence, chips and semiconductors, robotics and hardware, healthcare, enterprise services, new energy, cross-border expansion, and consumer lifestyle.

ZhenFund — Your First Stop for Entrepreneurship!

Contact Us WeChat: 真格基金 (ID: zhenfund) Website: www.zhenfund.com Email: media@zhenfund.com

Listen on Xiaoyuzhou, Apple Podcasts, and Ximalaya. We welcome your feedback and suggestions in the comments!