2025: Three Keywords for Large Language Models

线性资本·December 18, 2024·9·1

After spending a week watching AI product launches.

Over the past week, OpenAI and Google rolled out a series of model and product updates. These launches didn't just showcase the cutting edge of technology — they also revealed the dominant theme of AI's near-term development: the race to scale up model size has slowed while hardware capabilities await their next leap. After all, even if the tech giants can afford to train larger models, everyday users still need to be able to afford to run them. In its place, we're seeing gradual innovation across other paradigms — a trend that's more pragmatic and more directly impactful on real-world applications.

Native Multimodal Capabilities

When GPT-4o launched, we noted that its multimodal capabilities might be the most important milestone of the year. Its native voice capability transcended the traditional ASR-LLM-TTS pipeline, delivering low-latency, emotionally expressive interactions. This didn't just improve interaction quality — it opened up entirely new application scenarios.

Gemini 2.0 pushes further, extending multimodal capabilities to video interaction. Compared to voice, real-time video interaction sees less demand, but its contextual applications are far broader. In my limited hands-on experience, once you turn on the camera, the model can discuss anything in the video stream with impressive recognition speed and accuracy. It can also look at my other screen and talk me through things in real time, almost like pair programming. There remain various technical and non-technical limitations — for instance, it can read the contents of an unfolded letter but refuses to read an open book. And possibly due to cost considerations, AI Studio doesn't yet seem to allow prompting the model to trigger on isolated video stream events.

Multimodal fusion is relatively straightforward with voice, since speech and text map to each other one-to-one. But how video capabilities can better integrate with language models is arguably more of a biomimetic challenge. Human visual perception involves far more than understanding scene content, so beyond direct video comprehension and generation, projects like GameNGen and Oasis are exploring other dimensions such as spatial awareness.

The greatest value of multimodal capabilities doesn't lie in human interaction. Rather, it's about leveraging cross-modal perception similar to humans to tackle more general tasks. Models will certainly continue advancing along cross-modal directions next year, and I'm looking forward to seeing how applications across different scenarios begin exploring how to harness other modalities within large models.

Native Long-Horizon Reasoning

Strong reasoning capability is a crucial step toward true agents. GPT-o1's reasoning is more of an intermediate state than a final answer. Whether considering reasoning itself or the optimization of reasoning capabilities, the efficiency of talking to oneself is simply too low. We need to fully utilize existing tools to explore problem boundaries, but this won't be the means to reach our ultimate destination.

The native long-horizon reasoning model we envision would internalize this reasoning — or System 2 thinking — within the model itself. This would similarly bring scaling of inference compute to solve complex reasoning problems. But the massive efficiency gap between computation and I/O means such architectures would be far more efficient than GPT-o1. There's considerable exploration in this direction, mostly attempting to connect search or reinforcement learning with language models, with the biggest challenge being how to ensure training efficiency. Future breakthroughs will likely come from two fronts:

Data: Providing more data suited for reasoning-based learning.
Model architecture: Discovering new training architectures as efficient as the transformer.

True Agents

Chat might be the pleasant surprise AI brought humanity, but agents are the form in which AI achieves scalable real-world deployment. Gemini 2.0's tagline cuts straight to it: "a new AI model for the agentic era."

So far, agents have resembled software modules more than anything else — beyond differing contexts, there's not much distinction, just breaking problems apart and executing them step by step. The true agents on the path to artificial intelligence require solving several additional model-level problems beyond the reasoning we've already discussed.

1. Specialization / Self-Learning

Each agent should demonstrate outstanding capability in the types of problems it specializes in
As an agent processes more problems, its performance should continuously improve

The first point is addressed today through targeted pre-training on important tasks, such as math or coding. How to improve a model's capability in a different domain (where underlying data distributions differ) using limited data and without pre-training remains a headache. The second point is even less amenable to pre-training and can only be compensated for through external knowledge retrieval, but distribution differences determine the ceiling.

OpenAI's RFT fired a shot in this direction. One can imagine application developers using existing domain data to perform reinforcement learning on o1's chain of thought. And after deployment, more feedback-bearing data could continuously optimize the model. ByteDance's earlier ReFT work explored a similar direction.

Current RFT still faces challenges in reward mechanisms and optimization efficiency. In an ideal scenario, we'd want complete chain-of-thought data with step-by-step rewards, but as we've noted, obtaining such data (and rewards) is a major problem. Perhaps clever data curation plus synthetic generation could help significantly here.

2. More Economical Inference

Scalable agent deployment doesn't necessarily require smaller models per se, but the resource footprint per inference should be smaller. This somewhat contradicts the trend toward inference-time compute. History moves in spirals — going forward, we'll see more efficient inference architectures and a new generation of hardware optimized for inference. We've previously discussed the continued progress of small models, and this time Gemini 2.0 Flash directly outperformed the previous Gemini 1.5 Pro. Along this parallel track, data optimization can produce smaller, more economical models without sacrificing performance, further reducing inference costs.

3. Controllability

Our ultimate expectation for agents is obviously for them to team up and complete entire tasks end-to-end. But the biggest difference between today's systems and all previous ones is the uncontrollability at their core. This problem can't be fully solved through prompting, which is why most important applications to date still require AI, humans, and deterministic programs to perform intermediate and final checks.

This may be the area where AI research progress is least clear (though there are plenty of engineering practices). Most work I'm aware of currently focuses on interpretability. But as deployment scenarios grow broader and deeper, the importance of interpretability and, further, controllability will become increasingly prominent.

Final Thoughts

Last week's series of product and technology launches undoubtedly excited me. Our forward-looking projections here aren't meant to suggest today's models can't be deployed because of these issues. Quite the opposite — even if model progress stopped today, the AI capabilities we already possess are sufficient to generate tremendous value across a wide range of future scenarios. What matters in between is how application developers unearth these scenarios. This article is also about sharing our envisioned roadmap for the future, hoping that on one hand, you'll choose approaches suited to today's model capabilities for delivering software or services, and on the other, you'll prepare for tomorrow. The pace of technological development is like a river — perhaps every sharp bend is unexpected, but it always flows forward, eventually reaching the sea.

📮 Further Reading

Linear Bolt Bolt is Linear Capital's dedicated investment program for early-stage, global-market-facing AI applications. It upholds Linear's investment philosophy, focusing on technology-driven transformative projects, with the goal of helping founders find the shortest path to their objectives. Whether in speed of action or investment approach, Bolt's commitment is lighter, faster, and more flexible. In the first half of 2024, Bolt invested in seven AI application projects including Final Round, XinGuang, Cathoven, Xbuddy, and Midreal.