Reinforcement Learning from Human Feedback
RLHF
Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning AI models with human preferences by using human judgments as reward signals in a reinforcement learning loop. In this framework, a model generates outputs, humans rank or rate those outputs, and the system learns to optimize for the patterns that receive higher human approval .
The method was notably deployed in OpenAI's InstructGPT, where it marked a shift from pure language generation toward more precise response to user instructions . RLHF sits between supervised fine-tuning and what some researchers consider "true" reinforcement learning: it elevates model performance from "generative human" level to "discriminative human" level, since judging outputs is easier for people than creating them, and gains an additional boost from aggregating across many human raters . Andrej Karpathy, in a 2024 discussion translated by 5Y View, characterized RLHF as "just barely RL"—arguing that it cannot in principle exceed the collective judgment of expert human panels, and that genuine superhuman performance would require moving to RL with automated rather than human feedback .
In practice, RLHF's effectiveness depends heavily on the design of reward signals and the quality of human preference data. It faces particular challenges in multi-turn conversations, where delayed and sparse outcome rewards make credit assignment difficult, and where models may learn superficial behaviors like excessive questioning that game the optimization target without providing genuine value . The technique has also been applied beyond text—to music generation, for instance—though whether LLM alignment methods transfer cleanly to aesthetic domains remains uncertain .
As of mid-2025, the field is seeing a broader shift toward pure RL and the development of more scalable evaluation frameworks, with some researchers treating RLHF as an early, data-intensive phase in a progression toward more autonomous learning systems .
AI-generated — may contain errors, please verify.
Coverage
Moonshot AI Founder Zhilin Yang's Latest Take: Deep Reflections on OpenAI's o1 Paradigm Shift | Z Talk
The Next Phase of Foundation Models: A New Paradigm?
真格基金·Ten Thousand-Word Conversation with Scale AI Founder Alex Wang: Why Data, Not Compute, Is the Biggest Bottleneck for Large Models|Z Talk
We've exhausted all the easily accessible data.
真格基金·天之杯:AI与游戏的根源之涡丨5Y View
任何足够先进的科技,都与魔法无异。
五源资本·


