Becoming China's Most Daily-Interaction Heavy AI Company: 15 Things MiniMax Got Right

暗涌Waves·September 2, 2024

There are two optimistic signals regarding large language models.

By Lili Yu

Another Chinese large model startup has joined the frenzied race in video generation. Following companies like Kuaishou, Aishi Technology (PixVerse), and Zhipu AI, MiniMax officially launched its video model abab-video-01 and music model abab-music-01 at the end of August.

Compared to other video models, MiniMax founder Junjie Yan believes abab-video-01 offers "higher compression rates, better text responsiveness, and greater stylistic diversity." He also explained the reason for the end-of-August release — "the underlying infrastructure we had built for text all needed to be upgraded" and "video is, most of the time, more complex than text."

Currently, abab-video-1 only offers text-to-video capabilities, with future iterations planned for image-to-video, editing, and controllability features. More crucially, Yan previewed that within the coming weeks, MiniMax will release "a new version, abab 7, that can match GPT-4o in both speed and quality."

Among China's seven large model startups, MiniMax has long been one of the earliest founded and the fastest movers in consumer productization and commercialization. Now, Yan tells us, it is also "the company with the largest daily interaction volume among all Chinese large model startups."

The following are excerpts and edited remarks from Junjie Yan and MiniMax Technical Director Jingtao Han at "2024 MiniMax Link Partner Day," offering multiple perspectives on MiniMax's underlying thinking around R&D, product, and commercialization.

1. One Person, 3,000 Lives in a Day

Currently, MiniMax's large models interact with end users (including our own products and open platform partners) 3 billion times daily: processing over 3 trillion text tokens, generating 20 million images, and generating 70,000 hours of voice content every day. Among all Chinese large model companies, MiniMax now handles the largest daily interaction volume. Three trillion text tokens means one person experiencing 3,000 lives in a single day.

2. Most Human Communication Happens in Multimodal Form

Why build a video generation model? Fundamentally, most of what we consume daily isn't text — it's dynamic content. Open Red Xiao (Xiaohongshu) and it's images and text; open Douyin and it's video; even open Pinduoduo to shop, and most of the time you're looking at images. The majority of information is expressed through multimodal content; text is often just the most distilled fraction of it.

3. Why Two Months Behind Kling?

The core reason is that we were solving a harder technical problem. Video, most of the time, is more complex than text because video context is naturally very long. A single video represents millions of tokens in input and output — inherently difficult to process.

Second, video volume is massive. A five-second video is several megabytes, but five seconds of text — roughly 100 characters — is less than 1KB. That's a thousand-fold difference in storage. This meant the underlying infrastructure we had built for text — how we process, clean, and annotate data — was no longer adequate and all needed upgrading.

4. "Fast Is Good"

"Fast" is the core R&D objective for MiniMax's underlying large models. Because of the scaling law in large language models — with the same algorithm, more training data and parameters means better performance. So between two similarly performing models, the one that trains and infers faster can use compute resources more efficiently to iterate on more data, thereby achieving better model capability.

Before MoE architecture gained industry recognition, we were the first in China to achieve a core breakthrough in the MoE algorithmic approach. In our previous generation model abab 6.5s, the MoE model was 3-5x faster than dense models. This is also a core reason why abab 6.5s could handle billions of daily interactions.

Additionally, Linear Attention not only brings an order-of-magnitude improvement but is also a critical step toward solving infinite-length input and infinite-length output. The core technology of the abab 7 model is built on MoE + Linear Attention. Beyond that, we've also constructed multimodal understanding capabilities in abab 7.

5. On Scaling Laws

Scaling laws have held true at least in recent years, and we've been able to stay on the predicted curve. Beyond parameters, data volume, and compute, context length is also a crucial dimension. Getting Linear Attention right is equally important.

In academia, the idea of Linear Attention has existed for a while — some believed in it, others didn't. We encountered many engineering bottlenecks along the way, but our research has now reached a state where we feel we have good, usable control over Linear Attention.

6. Science and Technology Are the Primary Productive Forces

Whenever our models improve significantly, or processing speed increases markedly, we see user scenarios and depth of use rise significantly. Conversely, there's a real case that happened: a bug caused conversation repetition error rates to spike, and conversation volume dropped 40% that same day. This explains the fundamental reason we persist in technological innovation.

7. The Same Secret in Positive and Negative Feedback

Technical R&D requires massive investment — one look at our monthly bills is heart-wrenching. When something is this expensive, you often wonder whether to take shortcuts. For instance, should we stop doing technology and just improve the product first?

Also, in technology, something usually takes three experiments to succeed. When the third one finally works, you wonder if the first two were necessary. Like eating three buns to feel full and wondering if you could have skipped the first two.

But our practical experience has proven: taking shortcuts gets you slapped in the face — we've verified this more than ten times. And when we summarize most positive feedback, we find it traces back to technological progress. So whether it's positive or negative feedback, the key ultimately turns out to be technology.

8. More Than Just Chat Companions

Products like our Talkie (Xingye) are fundamentally designed not as chat companions but as content communities. Users can create stories, build worlds. Other users can then interact within those created worlds. What we hope to achieve with Talkie is personalization with substantial user input. Its core is a content community.

9. Going Global

Among all large model companies, we're the only one with significant international operations. We launched a product early on, put it overseas, and found people in twenty to thirty countries across the world, speaking different languages, started using it. But what we built wasn't chat companionship or AI emotional support — it's a next-generation content generation platform.

Overseas users have good payment habits, so monetization is relatively clear and faster. Our voice model, for example, is in the global first tier. The question is how to package it into a polished product, through API and self-service models, so users are willing to pay $5 or $10 monthly for a subscription. Our technology is there now; it's more a matter of company focus and resource allocation — how to monetize what we have.

10. Commercialization in China's B2B Market

To truly make money in B2B, you need to become an industry standard. In China, much of B2B business devolves into project-based work. If large models purely output technology and customize for every enterprise, the math doesn't work as a business model.

Additionally, in today's product form, as an ordinary consumer, there's zero loyalty. The moment you charge, they switch to another product — the model doesn't hold. Our current thinking is to continuously refine our more tool-oriented products, like Hailuo, adding new features until there's user stickiness and differentiation. Once stickiness is built, then we invest in acquisition. ROI will turn positive one day, but not with today's product form.

11. Two Optimistic Signals from the Price War

Last year, Chinese models had zero competitiveness overseas. But the hundred-model war, including the price war, has brought significant change. After prices dropped, companies that thought large models were expensive started seeing them as cheap. We were astonished to find that after the price war, many very traditional enterprises became willing to use large models — they figured the cost was low anyway, and if it made a mistake, they could just call it again. This massively increased model usage. After this reached a certain stage, we found that overseas — in Southeast Asia and elsewhere — our models had become competitive.

These are the two positive changes we see: domestic large model usage is indeed growing significantly, and Chinese models are increasingly competitive overseas.

12. The True Marker of Large Model Transformation

All models currently have double-digit error rates. True transformation comes when a model reduces error rates to single digits — this would be a fundamentally qualitative change. It would make many complex tasks go from impossible to possible, because complex tasks require multiple steps, and multiple steps mean multiplicative relationships — high error rate models simply can't multiply.

Why are there no agent applications now? It's not because the agent frameworks aren't well-written enough — it's because the models themselves aren't good enough. My judgment is: if scaling laws hold, this model will definitely appear, and the marker will be error rates dropping to single digits.

13. Three Optimization Directions for AI Applications

How to continuously reduce model error rates; infinite-length input and output; multimodality.

14. Three Early Judgments in Entrepreneurship

  1. Next-generation AI is an intelligence infinitely close to passing the Turing test — natural interaction, within reach, everywhere.
  2. Achieving this goal is a massive systems engineering project; you can't settle for 5% or 10% improvements — you need technological breakthroughs that bring order-of-magnitude gains.
  3. Start with high-fault-tolerance applications like casual chat and writing. As technology improves step by step, you can build more powerful, problem-solving-oriented applications. Ultimately bringing the extension of intelligence to everyone.

15. A Sense of Mission

During the 2021 Spring Festival, I went home to visit my grandfather. The life his generation lived through was my favorite story as a child. My 80-year-old grandfather wanted to write a memoir, but he couldn't type and didn't have the energy to research materials. In theory, AI was perfectly suited for this task — but regrettably, AI at that time couldn't do it. This made me realize that the ultimate goal of AI development is to become more general, to help everyone. Three words to summarize: Intelligence with Everyone.

Image source | IC photo

Layout | Yao Nan