5Y View｜Welcome to the Era of Reviews

五源资本·July 22, 2025·25·0

The Inevitable Path from Foundation Models to AI Agents

Recommender

Shi Yunfeng, 5Y Capital

Scaling laws have let us coast on the free data buffet of the internet. That data was enough to crack classic NLP tasks, but not enough to make models into general, reliable agents. Imagine training GPT-4o on every text written before the internet existed — even with unlimited compute, the data would fall far short.

Looking back at the "bitter lesson" of the past two decades, architectural improvements have always been incremental steps, while data-driven innovation has delivered outsized impact.

I expect this trend to continue. Today's reinforcement learning (RL) still sits in a "pre-GPT-3" paradigm: RL datasets are tiny. DeepSeek-R1, for instance, used only about 600,000 math problems for RL training — if each problem takes a human five minutes, that's roughly six years of continuous human effort. By contrast, reconstructing GPT-3's 300 billion token training corpus would require tens of thousands of years of writing at normal human speed.

Building evaluations and RL environments is the highest-leverage, most durable use of human time.

Welcome to the Era of Evals

Author: Brendan Foody

Original link:

https://mercor.com/blog/welcome-to-the-era-of-evals/

Reinforcement Learning (RL) is driving the most exciting breakthroughs in AI. As RL becomes dramatically more effective, models will soon "saturate" every existing evaluation. This means the only remaining barrier to deploying agents across the entire economy is building evaluations (evals) for everything.

Yet AI labs are facing an "eval famine": the benchmark-style evaluations borrowed from academia are seriously misaligned with what consumers and enterprises actually need.

Evals are becoming the new PRD. In accelerating knowledge work, progress will converge on building environments and evaluation systems that map to real work scenarios and deliverables. This RL-centric new paradigm of human data is vastly more efficient than pretraining, supervised fine-tuning (SFT), or RLHF. Most knowledge work inherently involves repeatable workflows — a variable cost — but encapsulating them as environments or evaluation systems transforms them into a one-time fixed cost.

Training on Verifiable Rewards

RL environments allow scoring both final outcomes and intermediate steps. Models make multiple attempts at the same problem, using test-time compute to "think before answering." Human-written autograders reward trajectories that were "on the right track." Continuously reinforcing on these "good trajectories" teaches models to use correct chains of thought to solve various problems — researchers simply keep "hill-climbing" on evaluations. These environments roughly fall along a spectrum of verifiability:

Objective domains

Games like Pac-Man, chess, and Go have clear state spaces, action spaces, and target outcomes. Math, coding, and even certain tasks in biology can be verified with near game-like precision. Objective domains are where RL has already scored decisive victories: AlphaProof, AlphaFold, DeepSeek-R1, and numerous code generation models.

Subjective domains

In the real world, some tasks resist easy quantification — writing investment memos, preparing legal documents, providing psychotherapy. It's difficult to judge whether a model has achieved the intended goal. Moreover, experts often hold diverse, coexisting views on "ideal processes and outcomes." In such contexts, rubric-based reward mechanisms can learn from the complexity of expert human judgment. Building environments and training with such rubrics is a promising research direction, with early foundations traceable to Anthropic's Constitutional AI and RLAIF projects.

Computer-use agents: somewhere in between

For most tasks humans perform on computers, goals may appear fuzzy at first but once clearly defined, both behaviors and outcomes can be verified programmatically. These include planning trips, replying to emails, online shopping, social media posting, and more. Through containerized environments, thousands of parallel interactions can learn online with virtually unlimited horizontal scaling.

Environments as Experience

Eventually, our AI systems will learn automatically from real-world signals — students' test scores improving, sales closing, perhaps even bridges being built. But even then, intermediate rewards will remain indispensable. Just as humans learn from others, models need guidance on which teaching methods or sales techniques work best. Humans will remain an irreplaceable part of the environments models learn from.

We can never escape the "era of data"; it must follow us to the frontier. And that frontier is environments built by human hands, providing sustainable supplies of experiential data. These environments serve both training and evaluation.

The Path Forward

To satisfy today's data hunger, we must rethink "how to distill signal from human labor." Building evaluations and RL environments is the highest-leverage, most durable use of human time. Mercor has pioneered environment generation using autograders, continuously pushing the boundaries of RL data across simulated workspaces, multi-turn interactions, and multimodality.

Knowledge work will rapidly converge on one thing: building RL environments and evaluations for agents to learn and evolve within. When AI truly enters the workplace, engages with proprietary information, and operates within unique professional contexts, these environments codify both knowledge and goals for the agents. Once every step of a workflow becomes sufficiently reliable, the only remaining task is to make human-defined goals the North Star of RL training.

Original text below:

Reinforcement Learning (RL) is driving the most exciting advancements in AI. RL is becoming so effective that models will be able to saturate any evaluation. This means that the primary barrier to applying agents to the entire economy is building evals for everything. However, AI labs are facing a dire shortage of relevant evaluations. Academic evaluations that labs goal on don't reflect what consumers and enterprises demand in the economy.

Evals are the new PRD. Progress in accelerating knowledge work will converge on building environments and evaluations that map real workspaces and deliverables. This new RL-centric paradigm of human data is vastly more data efficient than pretraining, SFT, or RLHF. Most knowledge work includes recurring workflows as variable costs, but creating an environment or evaluation can transform that into a one-time fixed cost.

Training on Verifiable Rewards

RL environments allow for rewarding outcomes and intermediate steps in an evaluation. Models take many attempts at a problem, using test-time compute to "think" before it answers. Human created autograders reward the attempts which were "good". Reinforcing on those "good" trajectories upweights the chains of thought that were used to get to the answer. This teaches models to think correctly about different types of problems as researchers iteratively hill climb evals.

These environments can be thought of as existing on a spectrum of rigidity between two categories:

Objective domains: Games, like pac-man, chess, and Go, have clear states spaces, action spaces, and desired outcomes. Math, code, and even some tasks in biology, can often be formulated with near game-like verifiability. This is where RL has achieved early massive success already, notably, AlphaProof, AlphaFold, and DeepSeek R1 and the many code generation models on the market today.

Subjective domains: It's more difficult to measure accuracy in many real world tasks such as generating investment memos, making legal briefs, providing therapy. This makes it difficult to verify that a model achieved desired outcomes. Additionally, experts often support multiple valid opinions about desired processes and outcomes. Rubric-based rewards serve as a way to learn from the messiness of expert human opinions. How to evaluate and train with rubrics as environments is an exciting area of research with roots laid as early as constitutional AI and RLAIF work from Anthropic.

**Computer-use agents **sit somewhere in the middle. For most of the tasks humans do on computers, goals start to become ambiguous and multi-faceted. Once defined, the actions and outcomes are programmatic and verifiable. These could include planning trips, responding to emails, shopping, or posting on social media. In all of these cases, containerized environments allow for horizontal scaling to learn online from thousands of interactions in parallel.

Environments Create Experience

Eventually, our AI systems will learn automatically from signals in the real world like pupils' test scores increasing, sales closing, maybe even bridges being built. However, intermediate rewards will always remain critical. Similar to how humans learn from other people, models will need guidance on which styles of teaching and sales techniques are most effective. Humans will remain an integral part of the environments models learn from.

We will never escape the era of data; it must follow us to the frontier. That frontier is human created environments that provide durable sources of experiential data. These environments can serve to train and evaluate models.

The Path Forward

Meeting today's data demand requires rethinking the way we generate signal from human efforts. Creating evals and RL environments is the highest leverage and most durable use of people's time. Mercor has helped pioneer environment generation using autograders and continues to push the boundaries of RL data with simulated workspaces, multi-turn support, and multi-modality.

Knowledge work will quickly converge on building RL environments and evaluations for agents to learn from. As AI enters the workforce and operates over proprietary information and under unique professional contexts, these environments codify knowledge and goals for agents. Once individual steps of agentic workflows reach sufficient reliability, all that will be left will be RL training on the goals laid out by humankind.

5Y Capital seeks out, supports, and inspires solitary entrepreneurs, providing support from the spiritual to all operational matters. We believe that if the "crazy" you in others' eyes begins to be believed in, the world will become a different place.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG