The Next Step in Intelligence Enhancement: Is Agent a Patch or the Future? | MonoX

Monolith砺思资本·September 15, 2025·70·0

The Value Hierarchy of Large Model Training Is Quietly Being Rewritten

Over the past few years, the rise of large language models has reshaped the narrative around artificial intelligence. From the relentless scaling of parameters to explorations in multimodality and experiments with agent architectures, the field has been evolving at breakneck speed.

Meanwhile, attention within the industry has shifted as well: How should we measure whether a model is "good"? Can data feedback mechanisms really sustain long-term iteration? Is multimodal fusion the future, or a technical trap? Are agents opening up an entirely new form of intelligence?

The answers to these questions may not converge, but together they sketch out the central uncertainties and opportunities in large model development.

Recently, Monolith invited several tech leads from leading LLM companies to the latest MonoX "LLM Tech Lead" offline event, where they discussed the most cutting-edge technical advances in large language models.

We've compiled the core takeaways from that discussion into this article, hoping it proves useful for readers in the industry.

MonoX continues to track the latest developments in LLMs, exploring frontier technologies and engineering topics. We'll be hosting another tech lead event in Haidian at the end of September — scan the QR code below to sign up.

Contents:

What Makes a Good Model and Good Data
The Omnimodal Model Dilemma: 1+1+1 < 3
The Evolution of LLMs: From Models to Agents
Crossing the Signal-to-Noise Ratio: How LLMs Can Improve on Subjective Questions

1. What Makes a Good Model and Good Data

1.1 A "Hackable" Model Is a Good Open-Source Model

To answer this, we first need to be clear about what standard we're using to define "good."

If the goal is to push the upper limits of intelligence, general-purpose models remain the mainstream direction. All domain-specific models are, in essence, data for general models.

But the intelligence level of a general model isn't equivalent to the value it delivers to users. Turning a general model into a product with real user value is hard, involving substantial non-technical barriers and organizational R&D challenges.

On the last mile connecting model capability to users, there will always be wrappers. Some simple engineering patches meant to fix model shortcomings will naturally be phased out as base models improve.

Other wrappers are complex systems engineering — for instance, orchestrating multiple models to work together. The value of these wrappers actually grows more prominent as base model capabilities strengthen.

From a technical standpoint, a truly valuable open-source model may not be one that pursues extreme performance above all else. What matters most is whether it lets the community "hack on it."

What does "hack on it" mean? Essentially, downstream development. A good open-source model needs to leave enough room for improvement and R&D, so researchers can build on it. Alibaba's Qwen series, for example, is a "good" open-source model in this sense: it spans a wide range of sizes, offers solid quality, yet is "imperfect enough" — making it more energizing for the community than something monolithic and hard to modify like DeepSeek.

The Qwen model family

At its core, a good open-source model should provide ecosystem participants with a platform to validate their ideas and realize value.

1.2 The Failure and Reconstruction of the "Data Flywheel"

In current AI product development, the concept of the "data flywheel" from the internet era is frequently discussed.

The basic logic: after a product launches, it collects real user feedback from actual scenarios, continuously post-trains the model, improves product performance, attracts more users, and creates a virtuous cycle.

In practice, however, the data flywheel doesn't always work.

Real user data certainly has advantages synthetic data can't match. Synthetic data tends to be homogeneous, while real users' questions in complex scenarios have a diversity and uniqueness that's hard to generate through preset logic. For example, when a user takes a photo at a historical site and asks the model "Where am I? Can you identify the historical background here?" — a real question combining multimodal information and complex reasoning — it carries enormous value for model iteration.

But this doesn't mean all user data is valuable. Large-scale, unfiltered回流 data may just be noise. The core challenge is "mining" — can you identify from massive amounts of information the effective signals that truly drive model improvement? Once data volume reaches a certain point (say, feedback from 100 users), the key is no longer adding more data but increasing the depth and precision of data mining.

Meanwhile, practitioners have found that the effectiveness of data flywheels varies dramatically across application scenarios.

For instance, in open-ended interactions like emotional companionship and casual chat, building a data flywheel faces enormous challenges.

First, the signal-to-noise ratio of user feedback is extremely low. One team tried having users mark responses as "good" or "bad" for data feedback, but found such feedback highly subjective — users were often just expressing their mood in the moment, not making objective assessments of content quality. This kind of random feedback can't serve as reliable ground for model optimization.

Second, model capability constrains data quality. What users can talk about and how deep they can go depends heavily on the model's own abilities. If the model is weak, users quickly lose interest and fail to produce deep, valuable interaction data. This creates a negative cycle: weak model → shallow user interaction → poor回流 data quality → slow model improvement.

However, in current entrepreneurial practice, several more effective data acquisition approaches have emerged:

Incentivize high-quality users: Use points or similar mechanisms to encourage articulate, high-interaction-quality users (especially women) on the platform to engage in deep conversations, thereby unlocking high-value internal data.
External procurement: Purchase data from professional chat companion companies. Because this data has been market-validated (users paid for it), it tends to be higher quality.
Expert annotation: In social and emotional companionship domains, recruit socially adept "experts" to annotate data. It turns out that carefully curated and annotated data from domain experts far outperforms massive amounts of ordinary user data.

In short, in companionship scenarios with fuzzy goals and open-ended interaction, relying on large-scale automated data回流 isn't viable. Fine-grained operations and expert-driven, high-quality data annotation are key to improving model capability.

In stark contrast to companionship scenarios, data flywheels can operate efficiently where there are clear optimization objectives.

Typical examples are search and advertising. In GEO scenarios, when a user searches "where to buy gum," whether the first brand recommended by the model gets clicked is a very clear "outcome signal" that can fuel the data flywheel to improve the model.

Similar practices include Meta using AI to generate ad copy and directly optimizing the model through user click-through and conversion rates — with notable success, boosting ad conversion by as much as 10%.

Meta AI ad products

That said, even in this ideal scenario, data flywheels have their limits. Sustained optimization through massive data mainly corrects the model's cognitive bias toward user data distribution, rather than bringing fundamental breakthroughs in model capability. Marginal returns diminish after a certain point.

To sum up, building a data flywheel comes down to two core questions: one, does the business have clear optimization objectives? Two, what's the signal-to-noise ratio of回流 data? For AI product developers, the old maps from the internet era may not lead to new lands on this question.

2. The Omnimodal Model Dilemma: 1+1+1 < 3

As large model technology advances, "All-in-One" models capable of simultaneously understanding and generating text, images, audio, and other modalities are becoming the next frontier for industry exploration.

Current industry focus centers on whether adding multimodal data (like images, audio) to text-based large model training can improve overall model capabilities, and specifically whether it can enhance the core text capability.

An optimistic view holds that information from different modalities can complement each other, strengthening the model's overall cognition and generation. For example, after adding 5T of multimodal data, a text model should not only preserve its original 35T pure text processing ability but actually improve it — a "1+1 > 2" effect.

However, reality has been less ideal. In current practice, after mixing different modal data for joint training, overall model performance often falls short of expectations. Under existing technical frameworks, information fusion across modalities isn't lossless; instead, it frequently triggers declines in core capabilities including text understanding and generation.

For example, one developer found that when doing mixed training with text and audio, the model's degradation was far more severe than expected — its core text dialogue capability dropped by roughly 10%. That said, he observed that while text capability suffered, the model's unity and overall performance on audio tasks improved significantly.

To address this, one closely watched frontier direction is the "pure pixel input (Pixel in)" approach — bypassing traditional image encoders to integrate visual information more directly into the model. Preliminary small-scale experiments suggest this method may improve multimodal understanding. But it hasn't been validated at larger scales, and its specific impact on text capability remains unknown.

3. The Evolution of LLMs: From Models to Agents

3.1 What New PMF Can Agents Bring?

Over the past year, new models have shown substantial capability improvements. Yet it's obvious that these gains aren't evenly distributed across all domains.

Take programming: new models have generally seen significant AI coding ability enhancement, which has directly pushed the PMF of related applications further. Developers can use AI more efficiently for code generation, debugging, and other tasks. However, in deeper backend logic understanding and similar areas, differences between old and new models aren't pronounced.

The current trend in model iteration is more about deepening and optimizing in already-validated application directions, rather than opening up entirely new scenarios. Where might the key path to breaking product barriers and creating new PMF lie? An increasingly clear answer is: Agent.

In frontier practice, multi-agent systems are seen as highly promising. This concept stems from insight into real-world complexity: many problems are inherently highly parallel, which creates a natural conflict with the limited parallel channels (typically 4-5) of a single agent or human.

The traditional approach might have multiple agents share one massive knowledge base and context, but this leads to inefficiency and exploding token consumption. A superior paradigm, similar to the vision in the paper Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution, is to have multiple agents work in parallel on their respective tasks, communicating efficiently through intermediate exchange mechanisms rather than sharing everything.

Alita architecture concept

This collaborative mode can improve both efficiency and effectiveness, because through task decomposition and parallel processing, the system's total problem-solving efficiency far exceeds the serial execution of a single agent. Meanwhile, communication between agents can spark solutions beyond individual average capability. And this mode is sustainable.

Since each agent only needs to focus on context relevant to its own task, with small-scale, high-efficiency communication, total system token consumption can be effectively controlled.

When such an agent system emerges — with complex task planning and execution capabilities, dramatically improved efficiency and effectiveness, and controllable costs — new PMF will surface as well.

3.2 The Three Pillars of Agent Training: Base Model, Scale, and Data

When an agent begins to struggle with tasks exceeding 10 steps, how do we extend its capability to 100 steps or beyond? The answer still lies in three foundational and critical pillars: base model capability, training scale, and the decisive role of data.

First, the upper limit of an agent's capability is fundamentally determined by its base model. If the model's capability is insufficient during pre-training, it's hard to achieve qualitative breakthroughs in subsequent fine-tuning stages, whether through supervised fine-tuning (SFT) or reinforcement learning (RL).

Second, despite the flood of algorithms and strategies for agent training, looking back at the development trajectory so far, the key driver has always been scale, not algorithmic sophistication. When we expand the training environment (e.g., data volume) by 10x, the marginal differences between different algorithms may become negligible.

Third, the core of scale is data. Pre-training data determines the model's knowledge boundary. If a capability, knowledge point, or reasoning pattern never appeared or appeared extremely rarely in pre-training data, it's very hard to elicit in later stages. If the probability of a correct behavior pattern existing in the model falls below a certain threshold, then for training purposes, it effectively doesn't exist. You can't teach a model a concept it fundamentally can't comprehend.

3.3 How to Train GUI Agents

Unlike agents that rely on structured APIs, GUI agents must learn in complex, dynamic, and highly uncertain visual environments.

The core bottlenecks in training GUI agents are twofold: the difficulty of scaling interaction environments and the inherent instability of distributed systems.

Practitioners at the discussion clearly pointed out that the most critical action for improving GUI agent capability remains scale. This scaling isn't simply about increasing data volume, but manifests across three dimensions:

More: Provide massive numbers of interaction instances, letting the model learn from large quantities of action sequences.
Wider: Build extremely diverse interaction environments. If an agent trains in only a few environments, what it learns may be "hacks" for specific environments rather than general operational capability. Only with sufficiently rich environments can the model truly generalize.
Faster: Improve environment execution and feedback efficiency to support larger-scale training iterations.

The ultimate goal is that when data and environment scale cross a certain threshold, model capability will qualitatively transform through quantitative change, naturally generalizing to all scenarios without relying on some "silver bullet" algorithmic innovation.

However, on the path to scale, a theoretically unsolvable problem always remains: environment stability.

When an agent does online reinforcement learning, it needs to perform actual operations in thousands of parallel virtual machines or simulators. This is essentially a massive distributed system, where one VM might temporarily lag due to network jitter or momentary resource insufficiency, or might have completely crashed.

From the outside, these two states are hard to distinguish. To handle unresponsive VMs caused by the above situations, we must set a timeout mechanism. But this timing is extremely tricky: set it too short (e.g., 1 hour) and you might misjudge a recoverable environment as dead, losing training progress. Set it too long (e.g., 10 hours) and you severely drag down training efficiency.

This is a classic problem in distributed systems, so training GUI agents requires complex engineering strategies to cope with and tolerate this inherent instability.

Where does the real value of GUI agents lie? One core application direction is as a "human simulator," providing data for training more powerful reward models. By simulating human testing, exploration, and operation across various GUI environments, agents can generate massive interaction data. Though full of uncertainty, this data can effectively train a reward model capable of judging "what is good interaction" and "what is successful task execution."

4. How LLMs Can Improve on Subjective Questions

4.1 How Hard Subjective Question Evaluation Is

The biggest challenge in current model evaluation lies in subjective questions. The core difficulty is the "extremely low signal-to-noise ratio" of evaluation signals, making it hard to form unified, stable judgment criteria.

In practice, one practitioner once organized 30 annotators to make simple "good" or "bad" judgments on AI responses in companionship scenarios. The result was a 50-50 split, making it impossible to judge the true quality of AI responses. This dilemma is further amplified in multimodal generation domains like video and images. Even if a model has decent FID metrics on video generation tasks, from a human sensory perspective, the generated video might still have jittering issues.

In short, when it comes to subjective experience, humans can barely reach consensus among themselves — let alone define a clear optimization target for machines.

To better align models with humans, reinforcement learning, especially RLHF (reinforcement learning from human feedback), is widely applied. However, in complex scenarios involving multi-turn dialogue (like emotional companionship), applying RL faces the challenge of multi-turn on-policy data.

Specifically, a complete dialogue session often aims at a final goal, such as improving user satisfaction. But achieving this goal depends on a coherent, smooth multi-turn interaction process. Traditional RL methods might only optimize immediate reward for single-turn responses, which is far from sufficient. Ideal optimization should run through the entire dialogue, from first turn to last. Yet obtaining such on-policy interaction data requires the model to maintain continuous communication with real users — extremely difficult and costly to construct.

Currently, one attempt is to deploy models directly to online users and observe real feedback, but this hasn't yet matured into a scalable solution.

Overall, whether subjective evaluation or multi-turn dialogue, the root problem points to the same issue — reward design and definition.

In long-horizon dialogue, the final outcome reward (e.g., whether the user liked it) is sparse and delayed. It's like chess: you know you won in the end, but it's hard to attribute which intermediate move was the key one. To address this, researchers have tried introducing process reward models (PRM) to score intermediate steps. But practice shows PRM is also hard to get right, because the value of an intermediate step often needs to be determined in reverse from the final outcome — the signal-to-noise ratio of forward inference is low.

More interestingly, models exhibit opportunistic behavior during optimization. For example, in dialogue, a model might discover that constantly asking the user questions is the simplest, least error-prone interaction pattern, and thus tend to do so frequently. While this may perform well on certain local metrics, it violates the original intention of providing substantive help or meaningful exchange.

4.2 The Solution: Finding High-Signal-to-Noise "Proxy Signals"

Facing the dilemma of fuzzy definitions and extremely low signal-to-noise ratios, frontier exploration no longer obsesses over "positive definition" but instead turns to finding a quantifiable, high-signal-to-noise proxy signal and indirectly solving the problem by optimizing this signal.

The core methodology: abandon trying to define the problem from first principles, and instead look for learnable examples or optimizable metrics from the desired outcome.

Scenario One: Solving the Model "EQ" Problem

Unlike logical reasoning or knowledge Q&A, concepts like "emotional intelligence" share the characteristic of lacking strong consensus. We can't establish an objective, unified evaluation standard for "good emotional interaction" or "valuable response" the way we can judge a math problem.

In emotional companionship scenarios, practitioners at the discussion noted that specially optimized proprietary models can achieve average conversation turn counts twice that of general-purpose large models (like Doubao). This shows that a model's text capability — being "well-written" or "knowledgeable" — isn't fully equivalent to high EQ. Users pursue something hard to quantify: emotional resonance. There's something else.

How to find this something else? In this scenario, finding a "proxy signal" means no longer asking "what is high EQ," but instead finding "who are high-EQ people." Through data analysis, identify users on the platform with exceptional chat appeal, then use the conversation data of these market-validated "high-quality communicators" as core corpus for fine-tuning.

Scenario Two: Solving Whether the Model Engages Reasoning Mode

When we interact with AI, sometimes we want concise answers, other times we expect it to show detailed chain-of-thought (CoT). How should AI dynamically switch modes to balance user experience with computational cost?

A basic principle: whether to engage long chain-of-thought should depend on whether it brings tangible value gain to the user. If a brief answer already fully satisfies user needs, then invoking the more computationally expensive deep-thinking mode is unnecessary waste.

This problem is essentially the "EQ" challenge projected onto another dimension. Because "value" is likewise subjective: sometimes users enable chain-of-thought not to examine the logic, but simply because they "enjoy watching it think" — shifting from a logical need to an emotional experience. Thus, preset rules are hard to make work.

Here, we again see the "proxy signal" methodology applied, but in a more abstract and mathematical form.

One practitioner explored a solution: transforming a fuzzy "user intent classification" problem into a clear, quantifiable mathematical optimization objective.

Specifically, one can construct a "cost-benefit" evaluation coordinate system:

X-axis: Represents token count, i.e., computational cost.
Y-axis: Represents response performance score.

Any AI response is a point in this coordinate system. The ultimate optimization goal is to get the model to learn, in all situations, to make decisions that maximize the dynamic area under this "cost-benefit" curve.

5. Conclusion

To summarize, we believe that whether it's agent systems, multimodal interaction, or reward model optimization, these explorations ultimately point toward a shared goal: enabling artificial intelligence to deeply couple with human society, rather than remaining in a tool role.

When models possess explanatory power, can self-correct, and engage in long-term interaction with environments, they won't merely answer questions but truly participate in decision-making and creation. That may represent an inflection point in near-term intelligence evolution, and the most值得期待 direction.