Yunqi Capital | In Conversation with MiniMax's Junjie Yan: AGI Isn't a Weapon — It's a Product Ordinary People Use Every Day

云启资本·April 18, 2024

MiniMax is already one of China's most highly valued large language model companies.

In China's large-model startup scene, the debate between "technology believers" and "market believers" has remained heated.

MiniMax appears to be something of a middle path: technology believers who were convinced of AGI before ChatGPT's release, developing proprietary multimodal general-purpose large models; market respecters who, before industry consensus formed, bet on MoE models to reduce compute costs, building multiple products and opening API interfaces to carve out commercial paths.

With its "technology + product" dual-engine approach, the company has already achieved milestone results. Its released abab 6.5 series trillion-parameter MoE model approaches the world's leading large models like GPT-4. It serves over 20,000 enterprises and individual developers, with multiple applications exceeding 1 million daily active users; the company's valuation has surpassed $2.5 billion.

Choosing the road less traveled — gazing at the stars while keeping feet on the ground — this is the "tech pragmatist" we've been searching for.

In early 2021, Yunqi Capital partner Chen Yu approached MiniMax's founding team. Their first conversation covered foundation models and visions for AGI. These technological idealists hoped to co-create with users, turning ideals into reality.

Shortly after, we became MiniMax's sole early-stage investor in its angel round. We've been investing in AI for nearly a decade, and look forward to joining more tech pragmatists on the path to an AGI future.

This edition of "Yunqi Partners" brings you MiniMax founder Junjie Yan's first public interview, discussing the choices and persistence behind non-consensus bets.

"The technical path we chose also has the highest ceiling — with almost no retreat."

This article is republished with permission from LatePost (ID: postlate)

By Cheng Manqi

Edited by Song Wei

In China's bustling large-model startup scene, MiniMax founder Junjie Yan may be the most mysterious figure. He has never appeared publicly or given any interviews. Even with his company's $2.5 billion valuation, he remains silent.

Among high school and college students, products like Glow and STARFIELD have become popular — used for dating simulation games, conversations with virtual beings, "unhinged" creative writing, even building cities. But few know they all run on MiniMax.

Following its recent Series B, MiniMax is now among China's highest-valued large-model companies, with both Tencent and Alibaba on its shareholder list. It has only been around for two years.

By then, Yan had already served as VP at SenseTime, deputy dean of its research institute, and CTO of its smart city business group. Earlier, he researched computer vision at the Chinese Academy of Sciences and Tsinghua University. While technical talent often excels in narrow domains, Yan has thrived across multiple fields.

The 35-year-old Yan is a soft-spoken, perpetually smiling manager. Investors describe him as an excellent listener who puts conversation partners at ease — then quietly drops sharp observations when they relax.

Junjie Yan presenting at the 2021 World Artificial Intelligence Conference Algorithm Competition Finals.

In founding MiniMax, Yan showed no signs of a first-time founder. He appeared willing to make big bets.

In late 2023, while most Chinese peers continued iterating dense models — which offered more reliable performance gains — Yan directed nearly all R&D and compute resources toward something far more uncertain: MoE (Mixture of Experts) models. His judgment: if they were to serve tens of millions or even hundreds of millions of users, MoE was essential, or else "the cost and latency per generated token would be unacceptable — we'd collapse quickly."

Before large models became hot, MiniMax had secured substantial GPU compute from ByteDance's Volcano Engine at relatively favorable prices, acquiring ammunition; yet MiniMax owns no GPUs, as Yan believes holding assets would distort decision-making.

"The technical path I chose has the highest ceiling, with almost no retreat; the compute approach I chose is also aggressive," Yan says.

Robin Li believes the "dual-engine" approach — simultaneously building models and applications — isn't a good model for startups. But Yan believed from day one that for large-model startups to grow independently and become substantial, both technology and product must be done well.

MiniMax is the earliest, most prolific, and most product-invested among China's large-model startups: of roughly 300 people, nearly 200 work on product. Its first product, Glow, launched in October 2022, followed by STARFIELD, Hailuo AI, and at least four others — spanning companion social-entertainment apps to productivity Q&A tools, with multiple apps surpassing 1 million DAUs.

Yan believes startups have only one path to independent development: before the window of rapid technological evolution closes, build 2C products with massive user bases.

"Without products to capture it, even if you have a technical breakthrough, it ultimately won't be yours," Yan says.

China's market now has six large-model unicorns (MiniMax, Moonshot AI, Zhipu AI, Baichuan, 01.AI, StepFun), with over 100 models total. They are racing to catch GPT-4 — released a year ago — with fewer resources, while navigating intense competition.

Some prioritize survival first. Others believe "Go big or go home," unwilling to occupy a middle ground. Yan is the latter.

Below is LatePost's conversation with MiniMax founder Junjie Yan.

"Everything only becomes good when taken to the extreme"

· LatePost: An OpenAI engineer told us he judges whether an AI founder has true AGI conviction by whether they started their company before or after ChatGPT's release.

Junjie Yan: MiniMax was founded at the end of 2021. At that time, AGI was still a massive non-consensus in China.

We calculated back then that scaling GPT-3 by 100x would require enormous capital — perhaps several billion dollars. But at that point, we clearly didn't think China would have that much money willing to support a startup.

· LatePost: Some say you started with the metaverse and pivoted to AGI once large models became hot. How much did you actually believe in AGI at the outset?

Junjie Yan: We were founded before ChatGPT. Most companies came after. That's the core difference.

Before ChatGPT, there was no reference point. You had to experiment more. But the core was always technological progress; what was uncertain was product direction.

Our initial vision for AI products was an intelligent agent with simultaneous voice, visual, and text capabilities. We built a version with a 3D avatar, somewhat like a metaverse digital human, but its language, voice, and other capabilities were still driven by large models.

· LatePost: What do you think AGI actually is? If one day AGI is realized, how would we know it has arrived?

Junjie Yan: We had a fuzzy definition then, and it barely changed: the day people stop thinking of AI as AI, that's probably when it arrives.

Just as when we talk about Douyin today, you don't think of it as a recommendation-system-based content distribution software — you just think Douyin is Douyin.

· LatePost: MiniMax was the first domestic company to advocate for AI 2C. Why?

Junjie Yan: Before deciding to start a company, I kept thinking about what technological progress could generate sufficient societal feedback. What came to mind were electric vehicles, mobile internet. The biggest characteristic of these industries was serving ordinary people. And the prerequisite for serving ordinary people was commercialization — it's a product, not a project.

At the time, the entire AI industry was in difficulty, while industries that had truly succeeded operated differently. The conclusion was almost singular — to build sufficiently productized AI technology and products that serve the masses, rather than projects serving a few large clients.

So I never believed AGI would be like an atomic bomb, some doomsday weapon. It's a product, a service that ordinary people use every day — this is what we've insisted on most.

Moreover, AGI shouldn't be built by one company alone. It requires that company and its users to build it together.

· LatePost: This January you were the first domestically to launch an MoE large model. Other companies mainly iterated dense models last year, as progress was faster and more certain. Was betting on MoE a gamble?

Junjie Yan: Initially I thought we were gambling. Those months, others were advancing quickly on steadier paths, while we were betting on something harder.

We allocated over 80% of compute and R&D resources to MoE, with no Plan B.

· LatePost: MoE development began in summer 2023. Why was it necessary then?

Junjie Yan: First, we knew our basic resources and data. Based on our compute and data, only MoE could complete training at that scale — from the upper limit of what you could train, it had to be MoE.

Second, we already had many users, with 2B and 2C products, with models processing massive token volumes daily. We found that if we continued with dense models, the cost and latency of generating tokens would be unacceptable, collapsing quickly — so MoE was the only option.

Of course this may be industry consensus now: if you're building a trillion-parameter model, you can't do it dense.

· LatePost: How did you finally crack it?

Junjie Yan: The process was painful. We failed twice. We already had much uncertainty; doing something new added more. Challenges were expected.

For example, we'd train a model for half a month and watch certain metrics drift further and further from our initial estimates. It's like launching a rocket you expected to reach 30,000 meters, only to watch it veer off course. You start diagnosing what went wrong, fix the problem, and realize you're still not back in a good state — another failure. But you accumulate experience, pool it together, and try again.

Each attempt costs a lot of money. More importantly, time.

I later realized this wasn't really gambling. Many of the challenges weren't inherent to MoE itself, but stemmed from more fundamental issues: experimental methodology, network architectures, data structures, and so on.

Eventually, solving these problems wasn't about cracking MoE per se, but about identifying past deficiencies and making the entire R&D team more efficient and more scientific.

· LatePost: Someone who's worked with you described you as having a strong engineering mindset — you aim to achieve the best possible outcome within given constraints.

Junjie Yan: It's all calculated. Most decisions at our company are based on optimizing for specific variables. We're essentially solving equations.

· LatePost: Resources — those constraints — are shifting rapidly for every company right now. When you run the numbers, do you tend toward conservatism or risk-taking?

Junjie Yan: We basically always choose the most aggressive option, because you only get good results by pushing things to the extreme.

The technical routes I choose are also the ones with the highest ceiling, with almost no fallback. Our approach to compute is similarly aggressive.

· LatePost: I heard you don't buy GPUs, only rent them.

Junjie Yan: We don't own a single GPU, though we're probably the Chinese startup that actually uses the most GPUs in practice.

Because owning assets distorts your behavior. If I had a lot of GPUs, the commercially optimal thing would be to rent them out to others. I want to keep the company simpler.

· LatePost: Last October you faced a compute crunch. How do you avoid similar risks?

Junjie Yan: Become the biggest customer in the market.

For Chinese startups,

the better approach is to think about technology and product simultaneously

· LatePost: Robin Li has said the "dual-wheel drive" model isn't good for startups, yet you prioritized product from day one. How did you decide that?

Junjie Yan: When you first start out, you don't really have the luxury of thinking about such things — you have no technology, no product, no users. The first six or seven months were just about building the most primitive model. Only then did products become possible.

If everything were free, if you had an infinitely capable organization, then having great technology would matter most, because your user base, traffic, and monetization capabilities would already be in place — you could rapidly test many products.

But that's not how it works for startups. Without strong product capabilities to capture value, even technical breakthroughs ultimately don't belong to you. An independently developing startup must consider product.

· LatePost: OpenAI only started building the killer app ChatGPT after creating GPT-3.5. Before that, OpenAI didn't prioritize product as much.

Junjie Yan: That's because OpenAI had order-of-magnitude leads in technology, talent, and data accumulation, which gave it a roughly one-year startup window. I don't think any company in the world will get such a unique window again.

No one will be 10x OpenAI. No one can quickly produce something ten times better than the rest of the world.

This leads to the conclusion that for startups — at least for Chinese startups — the better approach is to think about technology and product simultaneously.

· LatePost: Some investors think you're building product too early. "You can't build Douyin on a BlackBerry."

Junjie Yan: By that logic, you shouldn't be doing technology either, since current technology won't be the technology of five years from now.

But clearly everyone agrees we need to build technology now: only by developing today's technology can you deeply understand it, and only then can you build the technology of three or five years hence.

· LatePost: Technological development is gradual. Is product development too? Products in this era are completely different from the last era's.

Junjie Yan: Product too. Many successful Chinese companies — miHoYo, Meituan, ByteDance, Li Auto — share one common trait: none of them succeeded with their first product. They all succeeded with their second or later product.

That's not my observation; a friend of mine summarized it.

· LatePost: So why not just focus purely on product? There are plenty of open-source large models available now.

Junjie Yan: The core reason is that understanding models is essentially equivalent to understanding products. The deeper you go into product, the deeper your model understanding needs to be.

Another practical reason is cost and response time. Without strong control over the model, it's hard to manage product cost dynamics or tune user response times. And when building products, you'll encounter many problems: which ones can be solved? Which can't? How do you iterate? These all require technical mastery.

One reality: many products last year were built on GPT-4. Why did no one create an experience rivaling ChatGPT?

· LatePost: Among product-focused companies, some concentrate on one main product, but you're doing many simultaneously — Glow, STARFIELD, Hailuo AI, and others. Why build a product portfolio instead of focusing on one or two?

Junjie Yan: OpenAI's products after ChatGPT haven't been that successful either. If even OpenAI fails at product, it shows there's a gap between what product understands about technology and what technology can actually deliver.

The core issue is: even with the best technology and the best product, there will be mismatches.

If you accept this gap, the objective law is: you should try more, fail more, and find what can actually succeed.

· LatePost: That sounds a bit like ByteDance's approach to product.

Junjie Yan: We're not qualified to operate like ByteDance yet.

Every company chooses the form that suits it best. For ByteDance, the most important thing is technical resources, because all its products are ready to go and it has unlimited product resources — so the more attempts, the better. And each investment, even in failed products, brings more experience and knowledge, which is hugely valuable for them.

We're the same. And compared to model R&D investment, product investment represents a smaller share of resources. Based on our company's current situation, you can calculate that this approach has the highest probability of success.

· LatePost: Both technology and product matter — have you agonized over which matters more?

Junjie Yan: We did before, but not anymore.

In the second half of 2022, we were building Glow, and had a very painful experience. The entire team got COVID, which caused a bug in our final year-end release that degraded user conversation experience by about 15%. Our DAU dropped 40% over the three-day New Year's holiday. We finally couldn't take it anymore and found the bug on the last day of the holiday — it was literally one tiny line of algorithm. We fixed it, and user numbers quickly recovered.

The lesson: at this stage, the core source of product value is still your model performance and algorithmic capability.

We've gone through this several times. You can build many product features, but you'll find that almost all major improvements come from model progress itself.

· LatePost: Building large models and this many products simultaneously — what's the biggest challenge?

Junjie Yan: The technology isn't good enough. That's the most fundamental issue. Our iteration speed is already fast, but we still lag behind the world's top models.

Tenfold Scaling Laws

· LatePost: Mistral, a leading European AI company, has already open-sourced an MoE model, and the industry widely believes OpenAI's GPT-4 is also MoE. Will MoE be a competitive battleground in large models this year?

Junjie Yan: MoE is just one piece of the puzzle; there are many other pieces. If something can be written in a single paper, you can basically assume it's not an absolute moat.

· LatePost: In this technology race, what non-consensus judgments does MiniMax hold?

Junjie Yan: If there's any non-consensus in this industry, within 6-9 months it quickly becomes consensus.

There are three things everyone can see now: First, Scaling Laws. Second, achieving the same model precision may require several-fold less compute and capital investment each year, because algorithms and publicly available academic research are increasing, and many people are doing free exploration. Third, focusing on improving data quality currently yields greater returns.

So from these three points — Scaling Laws, declining costs for same-precision models, and the importance of data quality improvement — you can basically derive some of the decisions we and other companies make. I think it's fairly straightforward.

· LatePost: How do you understand Scaling Laws? What possibilities does it reveal to you?

Junjie Yan: Scaling Laws is just a curve. You can believe in the original Scaling Laws, or you can believe in ten-times-faster, even hundred-times-faster Scaling Laws.

The 2020 paper that originally proposed Scaling Laws for large models, "Scaling Laws for Neural Language Models," identified compute, data volume, and parameters as the most important variables affecting model performance, and specified the numerical relationships between them: C≈6ND, where C is Compute, D is Dataset, and N is Parameters. It argued that factors like model architecture and layer count had less impact on performance.

It was more about providing a methodology: you can predict results of larger experiments through smaller-scale ones. Second, it helps align the industry toward common goals, because this endeavor requires division of labor across data, compute, chips, algorithms, and product — Scaling Laws gives everyone relatively consistent expectations.

As for that paper's formula and some conclusions, they may not hold up now. For instance, it argued that layer count and architecture mattered less, but at least several variables now appear to be important.

· LatePost: Such as? What variables might enable tenfold or hundredfold Scaling Laws?

Junjie Yan: Network architecture itself matters, for example. When we started MoE, we assumed good MoE architectures would resemble good dense architectures. We later found this wasn't the case — MoE itself can accelerate Scaling Laws.

Also improving data quality. Also compute allocation: you can allocate compute to training, or to data processing. Different choices can all accelerate Scaling Laws.

· LatePost: The power of Scaling Laws comes from its simplicity. When you introduce more variables, you break it.

Junjie Yan: Improving data quality, optimizing algorithms, and refining training methods — none of these have a ceiling. You keep doing them, and you keep getting better.

The real tradeoff is that their efficiency gains for Scaling Laws accelerate at different rates across different time cycles. But you can use small-scale experiments to predict which variables matter more at which stage. This is still the Scaling Laws methodology at work.

Why does China have to achieve several-fold Scaling Laws? When compute is abundant, you can optimize the original Scaling Laws. When compute is scarce, you have to optimize a several-fold version to reach similar results.

This isn't impossible. Another Silicon Valley AI company, Anthropic, built Claude-3, comparable to GPT-4, in less time. That's essentially amplifying the original Scaling Laws. If one company can do it, there will be a second, a third.

· LatePost: Long Context is being discussed a lot lately. Will it become a differentiated path in the large model race?

Junjie Yan: A good large model should support long context by default. We've always had it. We haven't emphasized it in our products mainly because of compute costs.

· LatePost: What's the technical approach to achieving even longer context?

Junjie Yan: Standard Transformers previously used non-linear attention. Over the past year or so, many people have been researching linear attention, which enables long context.

The benefit of linear attention is that when text gets very long, its computational complexity grows linearly rather than quadratically. But in practice, at 200,000 or 300,000 tokens, linear and non-linear perform similarly, because quadratic functions approximate linear ones early on. The difference only becomes pronounced at 800,000 to 1 million tokens.

From what I know, Google's Gemini 1.5 was the first model to approach linear attention. With other APIs now, when text gets very long, response times slow to a crawl. But Gemini 1.5 truly achieved 1 million tokens — compared to 500,000, response time only doubles rather than quadrupling.

So long context solves not the 200,000 or 300,000 token problem, but the 1 million and beyond problem.

· LatePost: 1 million tokens roughly equals 1 million Chinese characters. How many people need that?

Junjie Yan: User demand and the capabilities you provide co-evolve. Put a model here that far exceeds expectations, and gradually it will create demand from many people.

Before ChatGPT had voice calls, no one would have said their need was voice calling. But once it was released, many people used it.

Our voice conversation product — Hailuo AI's call feature — is also very popular. My grandfather is 80. The first time he used this product, he discussed historical figures with it for 40 or 50 minutes. I never would have imagined someone using it that way.

· LatePost: It seems you prioritized voice and other multimodal capabilities in your products rather than long context. How do you decide which technical capabilities to optimize first?

Junjie Yan: We have a phrase: Intelligence with everyone. We don't own this technology. That's our core belief.

AI was very hot last year, but probably only 100 to 200 million people worldwide have used AI products, with only tens of millions being heavy users. Because asking a good question and following up continuously has a very high barrier. The people truly willing to type are probably just the people in this room. More people are still accustomed to voice.

We value multimodality because it allows more people to use AI, including the elderly and children. When we added images and voice to our products, we could clearly observe changes in user onboarding barriers, even penetration rates. The exact same thing happened once before in mobile internet, from Toutiao to Douyin.

The Closer to the Endgame, the Higher the Value of Users

· LatePost: Your first product, Glow, let users chat with customized AI characters, similar to otome games (romantic roleplay). It was very popular in anime circles. How did you come up with this direction?

Junjie Yan: When we were doing cold start for the product, we deliberately sought out young demographics — AI enthusiasts, anime fans — and iterated the first several versions based on their experiences and feedback.

After we gained traction, we watched social media every day to see how users were using it. In our early product days, we didn't do A/B testing. We observed users, read user feedback, then verified with data and iterated.

· LatePost: What pitfalls have you hit in building products?

Junjie Yan: Early on we worked on intelligent agents. Our vision was that it would simultaneously have voice, visual, and text capabilities. That's why the company built three models from day one — language, voice, and vision.

We quickly abandoned 3D avatars because they don't scale. The only major industries previously using 3D were gaming and film, with development cycles measured in years. At the same time, I realized using deep learning for 3D was the wrong approach.

On the current platform — mobile phones — if a 3D person is constantly staring at you, that's inherently strange. Most of the time, interaction doesn't actually need a real visual presence.

· LatePost: Did you see this from certain data after launch?

Junjie Yan: Not data. When we made the first version of the avatar, we found two models to film. The moment we put the 3D figure on a phone, we knew this was wrong.

· LatePost: You hired a product manager before your first model was even built. How did you describe to him what kind of product you wanted?

Junjie Yan: I didn't know.

· LatePost: You didn't know?

Junjie Yan: It was unclear at the time because there was no reference. We just imagined an intelligent agent that could have free, long conversations with you. Its essence was information exchange and processing.

What we could be sure of was that the model's most important purpose was to serve the masses, so it would definitely be a product. That's why we found a product manager very early on.

· LatePost: Users have many demands. Which do you satisfy and which don't you?

Junjie Yan: Our tradeoffs became simpler over time. We look at whether a demand aligns with technology development trends, and whether it can bring 10x or greater change to that user group's experience.

· LatePost: In terms of product taste, what do you think makes a good product? Your current products have many features — somewhat complex.

Junjie Yan: Honestly, we haven't made it yet, so I don't have an answer.

When you ask whether products should be complex or simple, most people will definitely say simple. But I'm somewhat skeptical of this, especially in the early stages of an industry. Look at Tencent — before it made WeChat, it first made QQ, and QQ was a very complex product.

ChatGPT has roughly 30 million DAU and seems to have trouble growing beyond that. My conclusion is that a relatively simple AGI product, at the current technological stage, probably has a ceiling around there. But ultimately, I believe there will be very simple interaction forms that satisfy broader needs.

· LatePost: What inspiration did Sora (OpenAI's text-to-video large model) give you?

Junjie Yan: If Sora's response speed becomes very fast in the future — generating a 1-minute video not in 20 minutes as it is now, but in real time — that would be a huge change.

Then would it be a better video generation tool, or a better video generation community?

· LatePost: A video generation community — isn't the next step a super content platform?

Junjie Yan: Anything is possible. It depends on whether you believe the space is large enough, and whether you believe response times can become low enough.

· LatePost: What do you think might be the AI product with the largest user base in the future?

Junjie Yan: We've only made products with millions of DAU. We haven't made tens of millions or billions. Honestly, I don't know. I think it might still be information exchange and processing. Its value is enormous.

· LatePost: MiniMax's product DAU is already close to Character.AI (an American AI unicorn's app for chatting and interacting with various AI characters), and time spent is even longer. But some people question whether your good numbers come from good technology or from soft porn.

Junjie Yan: We've analyzed this. What truly makes users stay is absolutely not so-called soft porn. Take our product STARFIELD — its core is providing users a platform to exercise creativity and imagination.

We've spent tremendous time and energy ensuring content is more positive, continuously improving platform safety capabilities.

· LatePost: How much can technical improvement boost a product? You used MiniMax's self-developed MoE model on STARFIELD. How were the results?

Junjie Yan: Message volume increased 40% on launch day. Responses became faster — previously 4 seconds, now 1 second. This wasn't just because of MoE, but also other inference optimizations.

· LatePost: Is faster technical improvement and larger user base a causal relationship?

Junjie Yan: This is very tricky. If you're the industry leader, OpenAI, then it's probably causal. If you're not the leader, then it's not.

Over the past year, many Chinese large model companies haven't had many users, yet their technology still improves — because you can progress just by learning from the leader. But long-term, if you believe your model can approach the best models, then the weight and value of users becomes increasingly higher.

This is like compute. Does having more compute let you build better models? Not necessarily — improving data quality might have higher ROI. But long-term, with more compute, you can definitely build better models. So it depends on the time horizon.

· LatePost: How do you think AI-native super products will differ from mobile internet-era super products?

Junjie Yan: When building mobile internet products, people cared deeply about whether they'd uncovered a user pain point. But last year, the six or seven AI-native products that surpassed a million DAU weren't designed around pain points. They released breakthrough technology and gradually became products. Conversely, when they later tried targeted feature design, it wasn't very successful — like ChatGPT Plugins and GPT-S. If technology progress slows, it will become product-driven again.

The current product approach is still technology-driven, not product-driven.

· LatePost: Your product features are already quite granular now. For example, Hailuo AI frequently sends push notifications to attract users to open the app. You've actually done quite a lot of product optimization?

Junjie Yan: Recently we've also been reflecting — having too comprehensive a set of product features might be a somewhat negative signal, indicating you haven't spent the most energy on your core function.

· LatePost: What goals have you set for the team this year?

Junjie Yan: Technically, how to reach GPT-4 level. On the product side, how to 10x our user base, with a single product breaking through 10 million DAU.

· LatePost: 10x growth — that's massive.

Junjie Yan: Actually it's not that big. Mobile internet products are measured in hundreds of millions of DAU.

You Can't Kill Competitors With Funding Alone

· LatePost: How many AGI startups do you think China's current market capital and resources can support?

Junjie Yan: It won't be just one. The total resource pool is sufficient.

· LatePost: Many investors have already stopped looking at large language models. They believe startups have no chance competing in foundation models.

Junjie Yan: I lived through the previous AI development phase that was built on stacking up funding rounds. If a company needs to keep raising money to survive, then its real optimization becomes figuring out how to convince investors to give it more money.

My own preferred path is to gradually serve users and generate reasonable commercialization. Of course, with massive R&D investment, this is hard to achieve in the short term, but I believe we should explore this path.

· LatePost: When total market resources are limited, shouldn't the number one player try to raise the most money and starve everyone else? That's how much of last generation's mobile internet competition played out.

Junjie Yan: The idea that you raise money frantically so others can't — I think that's wrong. You can't kill competitors with funding alone.

Because among the leading Chinese startups, no one has an order of magnitude more resources than anyone else. Inflection points can only come from leads in technology, product, or commercialization efficiency.

· LatePost: Then what about compute? Compute resources are scarce too.

Junjie Yan: China has more compute now than before. Also, this goes back to Scaling Laws. When compute is insufficient, you need to find ways to optimize Scaling Laws by several multiples to achieve similar results.

· LatePost: How do you assess the gap between you and OpenAI?

Junjie Yan: We have our own metric we call "out-of-the-box usability rate" — whether a customer or developer can quickly complete a complex need after plugging into a large model API.

From our own open platform's perspective, GPT-4 can handle almost any request. For example, last year we encountered a need where a user provided a novel and asked the model to generate a multi-character, tone-inflected audio drama.

Very careful use of GPT-4 could do this. Our own model couldn't at the time, but now it can.

· LatePost: What about the gap between you and your Chinese peers?

Junjie Yan: We haven't tested against all of them. Because testing or not testing won't change what we do.

· LatePost: What will happen in China's large model industry in 2024?

Junjie Yan: Chinese companies will produce something comparable to GPT-4, and more than one will do it. But the more important question to think about is: what comes after that?

Treating the Company as a Function

· LatePost: You just said what's written in papers isn't a moat. Then what is the real moat in this field?

Junjie Yan: It's quite remarkable when you think about it — Pinduoduo started as Pinhaohuo, Meituan started as group buying, ByteDance started with Toutiao. None of these were the products that ultimately made them great.

The difference between moderate success and great success is that the great companies all made organizational innovations, which enabled them to keep producing increasingly powerful things.

· LatePost: Isn't the moat the people who wrote the papers?

Junjie Yan: I'll say something pretty scary: among the top 20, even top 50 people who've contributed to this field of large models, probably not a single one works at a Chinese company.

The genius path doesn't work for us right now. The only viable approach is to gather people with sufficiently strong fundamentals, build a good growth-oriented organization, continuously break through challenges together, and help everyone grow rapidly. I hope that in three years, the top 20 or top 50 contributors to this field will come from Chinese companies.

· LatePost: How do you want to build this organization?

Junjie Yan: I think of it as optimizing a function. This function has no closed-form solution — the essence is finding the direction of steepest gradient descent.

· LatePost: For example? How do you find the direction of steepest gradient descent?

Junjie Yan: For accelerating technical progress, it's learning from OpenAI, because that's the most certain path.

Not in terms of making model parameters identical to theirs, but learning how to make experimental methodology more scientific; how to fail faster and iterate more efficiently; how to define problems more clearly and concisely.

· LatePost: Pursuing gradient descent can trap you in local optima that miss long-term targets. How do you avoid that?

Junjie Yan: Our own evolution has been from looking at data very vaguely, to looking very deeply at data, to realizing that data alone isn't enough — you need to add better insight.

Much insight actually comes from long-term thinking. For example, if you only looked at short-term product data, you wouldn't realize you need to build a new multimodal model.

· LatePost: But can optimization function methods handle human problems? Like tension between technical and product teams.

Junjie Yan: When designing experiments or products, we make data instrumentation finer-grained, and use these data points to infer the real problems as much as possible, rather than relying on my subjective judgment or anyone else's.

We believe in data science. We didn't invent these methods — Chinese internet companies have already implemented them extremely well.

· LatePost: You previously said you wanted the organization to stay light, but you're already at 300 people, most hired in the past year.

Junjie Yan: It's actually still very simple — only three layers in the organizational structure: me, my direct reports, and their direct reports.

You could say we have only three departments: a technology department that I lead; a product department split between C-end products and the open platform, each with one head; and an operations and growth department that handles both product growth and company growth, with HR also under it, all under one overall head.

· LatePost: Your peers — Zhipu AI has roughly 1,000 people, Moonshot AI has roughly 200, you're at 300. What's behind these differences in headcount?

Junjie Yan: It depends on what you believe in. We don't need to prove anything to anyone else — we just believe in what we're doing. Some unnecessary positions, we simply don't need. We hire for whatever we need to do.

But we do want to build frontend products at a certain scale, so beyond algorithm and applied data talent, we also need people for inference systems, online services, development, and product operations.

· LatePost: What talent do you most lack at this stage?

Junjie Yan: More algorithm talent. We now know how to run experiments, and our resources allow us to run many experiments, but we don't have enough people to run them.

This year, video generation models will become practically useful. Based on last year's pattern, the first product to market has a better shot at major success. Many companies are now racing to be first.

· LatePost: How do you identify people who fit you?

Junjie Yan: Their joining raises the team's overall output. But this requires some posterior verification — some very strong people actually can't integrate into the team, while some who seem less strong can make the overall output stronger.

So in interviews, I focus on how they've collaborated with people around them on important projects — with mentors, with upstream and downstream colleagues.

· LatePost: You managed a large technical team at SenseTime. What have you learned about managing technical talent?

Junjie Yan: When you start wanting to do management, you may already be going off track.

What matters most is how to get everyone to build stronger things together, exceeding user expectations and the team's own expectations. AI may be a trendy industry right now, but it's not that magical. At minimum it's a science, so approach it scientifically: first, high overall talent level; second, the organization has a data-science-like methodology to quickly identify what works.

Put these two together, and that's what we really need to do.

· LatePost: How do you attract stronger people to join you?

Junjie Yan: Fundamentally, the organization needs to be strong and capable of continuously doing good things. This is the only path we can find.

· LatePost: What culture do you hope the company develops?

Junjie Yan: First, no shortcuts — we've taken shortcuts many times and gotten badly burned each time. Second, User-in-the-Loop. Third, technology-driven.

These are all distilled from our previous experiences and lessons.

I Seem to Be Slowly Becoming a Set of Basis Functions

· LatePost: SenseTime was your first job. What mark did it leave on you?

Junjie Yan: I think mainly confidence in the technology path of concentrating forces to accomplish big things.

Some feedback was also searing, which is why I want MiniMax's organization to stay simple enough. In an organization, when people feel something is wrong but don't say it directly — that's hugely damaging to everyone.

· LatePost: AGI was still non-consensus back then. How did you realize it was the direction?

Junjie Yan: It actually came from a moment of accidental reflection. In 2020 I was still leading a technical team at SenseTime. One day I suddenly realized I could no longer keep up with all the daily papers in AI. That struck me deeply.

As a technical person, the daily technical progress had already exceeded my comprehension. Human evolution is very slow. The only way is to have better artificial intelligence to help technology develop, or to accelerate human research speed.

I had another observation then: pre-2020 AI, including much of what I did at SenseTime, wasn't creating that much value for society.

This created a huge contradiction: you believe AI has long-term value for society, that only it can make human technological progress faster; on the other hand, much of what you were doing wasn't directly contributing to it.

Was it due to insufficient attention? Clearly not — society's attention on AI and capital invested were enormous. Considering all this, the only possibility was that our technical path was wrong, or the problems we focused on weren't what AI should truly be solving.

· LatePost: Many in the previous generation of AI practitioners actually recognized this contradiction, but no one could find a way out.

Junjie Yan: OpenAI's release of CLIP in early 2021 was very important to me. That's when I started to realize there's no essential difference between natural language and computer vision — they're one unified machine learning system. I saw the technical possibility of more general AI emerging.

When this happens, if you truly believe in AI, you should go do something about it.

· LatePost: How do you learn?

Junjie Yan: Meeting people stronger than myself — this may be one of the few short-term satisfactions entrepreneurship can bring me. I've been fortunate to meet some very top people who gave me perspectives. When you think from a higher level, many things actually become less difficult. Second, I read a lot of papers.

· LatePost: You said to avoid being comprehensively excellent in product. Are you yourself comprehensively excellent? Your promotion trajectory at SenseTime was very fast — you started in R&D and became a group vice president, as if you could handle any function.

Junjie Yan: I don't think I'm comprehensively excellent. The fact that I could do many different jobs in the past probably has to do with my upbringing. I was born in a small county in Henan province. There was no one around to teach me much of anything, so I had to figure things out on my own. That developed into an ability to independently understand how things work. I didn't want to be this way — I was forced into it.

But looking back today, that ability has proven incredibly useful. When I take on something I've never done before, I can quickly identify the underlying logic.

· LatePost: What do you see as your weaknesses?

Junjie Yan: Although I've done some technical work, I'm not a top-tier researcher. Maybe just second-rate.

· LatePost: That's hardly fair — your papers have nearly 30,000 citations on Google Scholar.

Junjie Yan: The top person in the world probably has 300,000.

· LatePost: You said to treat the company as a function. What kind of function are you?

Junjie Yan: (thinks for a long time) Back in school, when I learned about Taylor series expansion, I saw how something complex could be approximated by combining simple functions.

In other words, you can use a set of basis functions to approximate any function. I feel like I've gradually become a set of basis functions myself — combining in different weights to take different forms as needed.

· LatePost: We've talked for so long without touching on changing the world, changing humanity.

Junjie Yan: The things you truly want to do shouldn't be talked about every day.

· LatePost: Can you talk about it today?

Junjie Yan: It's still "Intelligence with everyone." This phrase has two meanings: first, we want to serve every person with the best technology; second, in our journey toward AGI, we need to iterate and grow together with our users.

And I've seen technological progress moving faster than I imagined.

Qianming He also contributed to this article.

Cover image source: Ford v Ferrari