Who Can Build China's ChatGPT? We Invested in This Company 15 Months Ago | Yunqi Capital ChatGPT Special

云启资本·February 23, 2023·8·0

The first multimodal AI large model startup in China

From tech applications to philosophical musings, discussions about ChatGPT keep heating up. One central question dominates:

Who will build China's ChatGPT?

Domestically, the large model space was virtually a blank slate — until recently, when major tech firms began announcing their own models and star entrepreneurs started sending out "hero posts" to jump in.

But as early as late 2021, a group of young people, driven by faith in AGI (Artificial General Intelligence), made large models their primary R&D focus from Day 1. Their first product, "Glow," now has nearly 5 million users, making them possibly the first Chinese startup with large model capabilities across three modalities — voice, image, and text.

"Yunqi Tech π" presents a ChatGPT special. This episode brings you the story of MiniMax, which Yunqi invested in 15 months ago, and its journey building AI large models.

As the only early-stage institutional investor in the angel round, Yunqi Capital partner Chen Yu said: "ChatGPT has pointed the way forward for AGI. Foundation-layer large model investment and entrepreneurship, much like autonomous driving L4, demands heavy resources — technology and data — and substantial capital, given the high technical barriers, long R&D cycles, and massive funding requirements. Early entry requires foresight, and long-term commitment demands resilience and conviction. We're honored to have partnered with the MiniMax team 15 months ago. We believe that with their rapid iteration capabilities, commercial execution, and cutting-edge technical vision, we can co-create the future of AGI together."

➤➤➤ "AI technology represented by ChatGPT will fundamentally change every software service category" — this assertion by Microsoft President Nadella has become consensus among most global tech practitioners today.

But while the overseas tech industry is throwing itself into this wave with full force, Chinese practitioners have made a dismaying discovery: the domestic large model landscape is virtually a blank slate. Only a handful of major companies have sporadically announced plans to launch their own models in the future, and a few star entrepreneurs have revealed intentions to start businesses in this space.

Against this backdrop, the "emergence from nowhere" of startup MiniMax is undoubtedly a welcome surprise. Founded a year and a half ago, the company has made large models its primary R&D direction from its very first day. Today it possesses foundation large models across three modalities, spanning voice, image, and text generation.

Based on its self-developed large models, it has launched Glow, an intelligent conversational agent generation platform. After four months, the app already has nearly five million users and hundreds of millions of daily user calls.

Before ChatGPT went viral, building large models was a "dumb business" — massive investment, niche track, and highly uncertain commercial prospects. Forget ordinary entrepreneurs; even internet giants with abundant resources rarely touched it or invested limitedly. And this is the direct reason for today's blank slate in domestic large models.

Precisely because of this, MiniMax's existence provokes curiosity. After speaking with several early members and technical leads, we found this to be a group of technical idealists with wildly different experiences and backgrounds, yet sustained thinking and exploration of AI. They came together because of their faith in AGI (Artificial General Intelligence).

At a time when people lament the difficulty of technological long-termism, the appearance of such a team seems to be exactly what people have been waiting for.

Three Large Models

From its very first day, MiniMax chose to make large models its primary R&D direction.

Currently, MiniMax has three foundation large models with distinct capabilities across three modalities: Text-to-Text, Text-to-Visual, and Text-to-Audio.

These three models correspond to content transformation and generation between different forms. Text-to-Text handles text-to-text conversion (for example, generating text responses to questions); Text-to-Visual handles text-to-visual image conversion (for example, generating images from text descriptions); Text-to-Audio generates sound from text.

Large models are a complex systems engineering challenge. MiniMax co-founder Allen (Yang Bin) compares it to building a rocket — the technology and papers are public, but that doesn't mean you can actually build the rocket. And as a startup, you need to achieve set goals within limited time and resources.

Early team member Gewen describes it this way: "Every technical judgment directly affects the final outcome, and every step is connected in series, so every decision matters." The team's diverse technical backgrounds allow them to complement each other's perspectives and engage in thorough discussion.

Allen told GeekPark that the team's first milestone was to bring all three large models to world-leading levels within six months. This tests the team's ability to make correct decisions amid countless technical choices, and drives them to explore more fundamental,底层 technologies. Allen said, "At the底层 technology level, we've done things that startups typically don't do."

The deepest layer of MiniMax's self-developed technology is the hardware infrastructure built to support large models — using efficient GPUs to provide stable, reliable parallel computing power, supporting multimodal computation across voice, text, and vision, with strong self-training computing capabilities and strong adaptability. Through this infrastructure layer, data and computing power serve as nourishment for the large models.

Beyond technical sophistication, the ultimate purpose of large models is to output services. In November last year, the company released its first product: Glow. After four months, this app already has nearly five million users.

Some users describe Glow as "an open world in first-person perspective" — a characterization the team finds apt. Players build their own world through dialogue with AI-driven intelligent agents. Glow offers experiences conversing with multiple intelligent agents of different "personas" — players can choose existing agents, such as characters from the novel The Three-Body Problem, or use language to describe personality and "sculpt" their own agent.

Glow's significance for MiniMax lies in proving out the interaction between large models and the real world. Through this product, large model capabilities serve users in concrete forms. For example, users can generate an agent's avatar through language description — this is the Text-to-Visual image generation capability; different agents possess different voice timbres and qualities — this is Text-to-Audio voice generation.

Creating your own intelligent agent on Glow | Source: Glow

Glow currently sees hundreds of millions of daily user calls. Making large model capabilities so widely accessible requires solving technical challenges of low cost, high efficiency, and stability. Therefore, above the models, MiniMax has built an inference platform (Computing Platform).

Allen describes it as "how do you make something very heavy feel very light to use? This is actually an enormously difficult engineering problem." In the future, this inference platform will support more applications. Through these applications, models will broadly interact with human behavior in the real world, and data will guide continuous model iteration.

A Team That Believes in AGI

MiniMax was founded in December 2021. Several of the team's core technical leads come from well-known AI companies and major tech firms at home and abroad.

Gewen (alias) graduated from Johns Hopkins University, where he spent 10 years in the university's lab researching computational natural language. His final internship before graduation was at Microsoft headquarters in the US, where he encountered generative dialogue systems — the technical possibilities excited him.

"Studying natural language processing, what I wanted to build was an algorithm, model, or intelligent agent that can understand human speech and communicate with people — that was my original intention for choosing this major," he says. The most attractive aspect of joining a startup like MiniMax was the opportunity to build a language model that interacts with a large number of real-world users and iterates from their feedback.

Founding employee Dacong (alias) previously worked at SenseTime. He deeply believes in AI's possibilities, but having lived through the previous AI wave, he also profoundly recognized the limitations of the last generation's AI technical paradigm.

In the past, AI technical teams worked by customizing individual models for specific application scenarios. Models proliferated but never truly connected, and maintaining thousands of models long-term was impractical. Even as technical levels kept improving through enormous effort, AI's real-world impact became increasingly limited. He started following language model developments when GPT-1 emerged in 2018, gradually realizing that language could serve as an interaction interface integrating different modalities.

Allen's research background is a PhD in computer vision. During overseas studies, he was a founding member of Uber ATG Research, experiencing the entire institute's build-out, and later lived through Uber's autonomous driving team being packaged and sold. He then joined autonomous driving startup Waabi as a founding member, with rich experience in data-driven end-to-end systems. In 2021, Allen met his current partners, and they regularly exchanged notes on breakthroughs in recent papers. Step by step, these breakthroughs made him feel that AGI was drawing nearer.

For the team, three small events across different industries between 2020 and 2021 solidified their conviction about AGI's arrival.

The first was GPT-3's release in June 2020. Model parameters jumped from past million-level, hundred-million-level scales to hundred-billion-level, and training methods shifted from data labeling to learning from various corpora. The simultaneous quantitative leaps in parameters and data triggered a magical qualitative transformation, giving GPT-3 reasoning capabilities and generalization abilities that past AI models lacked.

The second came half a year later, in January 2021, with the debut of cross-modal model CLIP. CLIP could not only explain images in natural language but also generate images from text descriptions. This bridged conversion between language and text — two different media forms. OpenAI's subsequent Text-to-Image generation tool DALL-E 2 was built on CLIP model technology.

The significance was that where different modalities previously required specially designed proprietary models, one technical framework could now process data across modalities and achieve very good cross-modal generation and transformation.

The third event occurred another half year later. In July 2021, Tesla showcased its latest autonomous driving technology at AI Day, proving for the first time that this end-to-end, fully data-driven technical path could be successfully applied to real-world autonomous vehicles. Only afterward did most autonomous driving companies globally slowly come to believe that this end-to-end deep learning technical stack could actually work in the real world.

Allen says these three events across different industries were connected by this group of people who had always harbored AGI dreams. They believe AI technology will undergo qualitative changes and upgrades within two to three years; based on this upgrade, AGI may arrive within this generation's lifetime.

Therefore, four months after Tesla AI Day concluded, MiniMax was formally established. According to the team, MiniMax at that time was likely the first company in China to go all-in on AGI.

There's another interesting small detail: during the preparation for entrepreneurship, several team members were avid players of Detroit: Become Human. In Allen's view, this game depicts an era of human-machine coexistence after AGI is achieved.

He believes future human-machine coexistence will certainly be realized. Robots may have physical forms or exist virtually, but their degree of intelligence completeness will enable them to form genuine relationships with humans — whether providing productivity or emotional companionship.

User-shared co-created storylines on Glow | Source: Xiaohongshu

"User-in-the-Loop"

"Since ChatGPT blew up, we've been pretty happy — saves us a lot of effort educating the market." At a small media briefing, a MiniMax founding member said this while chatting with attending journalists. This was also the company's first formal small-scale appearance; for the previous 14 months, the company had rarely spoken publicly, quietly focusing on technology and product R&D.

ChatGPT's paid accounts opened, and users surpassed 100 million in just two months, making it something entirely new. It is itself a large model, but its popularity and frequency of use also make it something like a "product."

"The biggest revelation of ChatGPT seems to be validating that what we're doing does have demand," Gewen considers this tremendous encouragement.

In Allen's view, this is precisely the most magical aspect of large models right now: "When it's general enough, with strong enough generalization capabilities, it has sufficient multi-task general abilities that it can often be used directly."

Already many people use ChatGPT to fix code bugs, look up information, write articles, even try generating reports — people use it according to their needs. Low enough barriers to entry, usable by diverse populations, give large models inherent product attributes.

"AGI companies are actually also an entirely new type of company," Allen explained at the briefing. Large model companies no longer build targeted solutions based on AI technology, but instead enable more people to directly engage in dynamic, real-time updated interaction with the technology through various means.

Under this framework, the traditional concepts of B2B and B2C no longer matter. Dacong notes: "We don't deliberately distinguish between them. What matters is how large a user base we can cover, and how much efficiency improvement or other value we can bring them."

One can imagine that when MiniMax was just founded in 2021, this logic would have led to repeated rejections in their early search for investors, partners, even employees. "Couldn't convince investors, because no one could understand. We explained countless times, and few believed," one founding member said.

With core technology on one end and specific users on the other, achieving truly smooth feedback and linkage between these two ends is one of MiniMax's core operating logics. The team summarizes this as "User-in-the-Loop."

Allen says the inspiration for this still came from Tesla AI Day 2021. Many technologies showcased at AI Day had first-version academic prototypes originating from him and former collaborators, but Tesla loaded these technologies into countless cars to interact with real-world users and iterate from feedback.

"I think it taught me one thing: when you have very cutting-edge technology, how do you, from a commercial company's perspective, place it in the real world and make real impact for everyone."

When asked about next steps, the team members' favorite phrase is "at our own pace." They indicate they will open model APIs this year, and will develop new products based on model capabilities going forward.

Cover image source: MiniMax