Back-to-School Review: Is MiniMax's New "Multimodal Family Bucket" Any Good? | Yunqi Capital

云启资本·September 3, 2024

Video, music, voice — it can do it all. Let's test it out together.

Today, multimodality has become a non-negotiable for optimizing AI capabilities.

On August 31, MiniMax, the AI large-model unicorn and Yunqi Capital angel-round portfolio company, unveiled its latest video generation model, music generation model, and speech model at its inaugural Partner Day, alongside the technical architecture of its next-generation multimodal model abab7.

From day one, MiniMax built multimodality into the foundation of its technology stack. So how good is this latest multimodal "full suite"? In this Yunqi Capital feature, we put it to the test.

Before we dive in, here's some on-the-ground feedback from Yunqi Capital partner Chen Yu, who attended MiniMax Partner Day.

Chen Yu, Partner at Yunqi Capital

As one of the few "available now" models already open to users domestically, the video generation model abab-video-1 was the star of the show. The results exceeded our expectations — it sits squarely in the top tier of domestic video generation models. The music and speech models launched that day are also quite mature; the music generation demo in particular was stunning. Also worth watching is abab7, the multimodal model built on linear attention that processes long text faster and more cost-effectively.

A few more figures caught our attention:

  • MiniMax's large models interact with global users 3 billion times daily
  • Process over 3 trillion text tokens per day — equivalent to experiencing 3,000 lifetimes in a single day
  • Generate 20 million images daily — comparable to the art collections of 400 Forbidden Cities
  • Synthesize 70,000 hours of speech per day — like reading 7,000 books in one day

In short, MiniMax is the domestic large-model company with the highest daily processing volume and interaction time. And now it's bringing new technical breakthroughs.

Video Generation Model: abab-video-1

This short film, The Magic Coin, is a demo of MiniMax's video generation model. Castles, ice fields, rainforests, magical realms — whether real-world locations or fantasy settings, the generated videos carry genuine cinematic quality.

The newly launched video-01 can generate high-resolution, high-frame-rate native video from text prompts, with strong performance in compression rate, text responsiveness, and style diversity. It can produce 6-second video clips from text and supports text generation within videos. Users can currently experience text-in-video generation for free on the Hailuo AI website (www.hailuoai.com/video), with native support up to 1280×720 at 25fps.

In the first week of September, we also used "back to school" as a theme and asked video-01 to create a clip of a robot carrying a backpack to school. Within about two minutes, we received the following segment. Click play to see how it compares to your own generated version.

Music Generation Model: abab-music-1

The hip-hop track above, themed around "back to school," was created using MiniMax's music generation model abab-music-1. The model supports versatile end-to-end music generation, capable of synthesizing multiple musical forms including instrumental pieces and a cappella works, greatly simplifying music recording and creation — it handles both lyrics and composition.

The music generation feature is also currently free for users and available through the Hailuo AI web version. Pop, urban, rock, blues, and more — come "unbox" your own tracks.

Speech Generation Model: abab-speech-1

The updated abab-speech-1 supports multiple languages including Cantonese, Korean, Spanish, and Japanese, with hyper-realistic, naturally nuanced emotional variation. We also generated two audio clips with different effects — have a listen.

Next-Generation MOE + Linear Attention Model Architecture

Partner Day also saw the release of a next-generation model architecture — MOE + Linear Attention, which enables faster, more cost-effective long-text processing. The multimodal model abab7, built on this architecture, will launch soon.

This architecture supports efficient training on massive datasets, dramatically improving practicality and response speed while significantly reducing large-model training and inference costs. Compared to general Transformer architectures, the new architecture reduces costs by over 90% at 128K sequence lengths, with advantages growing more pronounced as sequences get longer.

In capability comparisons with models in the same generation as GPT-4o, the next-generation abab model doubles efficiency when processing 100,000 tokens, with gains increasing as length grows.

Key Future Optimization Directions

At Partner Day, MiniMax founder and CEO Junjie Yan also shared the company's original vision and mission. On the critical directions for AI's future optimization, he highlighted three areas:

1. Continuously reducing model error rates: Current models still have relatively high error rates — sometimes impressive, sometimes unreliable. This is what constrains models from handling complex tasks, since complex tasks typically require multiple steps, and higher error rates cause failure rates to increase exponentially. Lowering model error rates is the most fundamental prerequisite for enabling complex task handling, and the core lever for deepening user engagement.

2. Unlimited-length inputs and outputs: Why does this matter? The simple reason is that humans have this capability — we can process unlimited-length input and output. Traditional large models see computational demands scale quadratically with input/output volume, quickly hitting ceilings that compute power can't sustain. This requires foundational innovation to solve.

3. Multimodality: It's easy to see in daily life that text interaction is just a small slice; voice and video interactions are far more common. Multimodal content — sound, images, text, and video — has become the mainstream of information transmission. To increase penetration, multimodality is the only path forward.

For more from Junjie Yan's candid sharing, click "Read More" for the full details.