A Conversation with Cao Yue of Sand.ai: Farther from Sora, Closer to the Endgame

暗涌Waves·April 23, 2025

An adventure that tried to break free from inertia.

"An adventure trying to break free from inertia."

By Yu Lili

In the history of China's large language models, "Lightyear Beyond" is undoubtedly the most dramatic episode.

It was the earliest "Chinese OpenAI" in the public imagination. It began in February 2023 with Wang Huiwen's rallying cry, his heroic call to arms. Then, due to Wang's illness, he made a hasty exit. The whole affair lasted barely four months.

Of the two co-founders Wang had recruited, "one had an infrastructure background, the other an algorithms background." The former was Yuan Jinhui, who later founded SiliconFlow. The latter was Cao Yue, who founded Sand.ai (Sandai Technology).

Such a volatile opening was unexpected by many. Because of this, Cao compared himself at that time to "a tree in winter." "Nothing seemed to change on the surface, but roots were spreading underground." A few months later, in January 2024, Cao founded Sand.ai.

This was a year of clamor in video generation: Sora detonated the field early in the year, Kling made a stunning debut mid-year. Whether ByteDance, some leading startups, or even large model companies like MiniMax, all entered the fray with noteworthy products.

Sand.ai, however, remained silent — apart from a July mention by venture capital "queen" Kathy Xu as evidence that she hadn't exited the primary market.

From this perspective, Sand.ai was undeniably the latecomer.

But in Cao Yue's view, their lateness stemmed from choosing a harder, more fundamental path — one fundamentally different from Sora's.

On April 21, Sand.ai officially launched Magi-1. Unlike Sora, which chose the DiT (Diffusion Transformer) route, Sand.ai integrated diffusion models with the higher-ceiling autoregressive (AR) approach.

Grounded in first-principles thinking about video data, Cao believes that in video generation — where technical approaches have yet to converge — autoregressive (temporal autoregression) may be the solution closer to the endgame. He stated: "Magi-1 is our first attempt to bring autoregressive video generation back into mainstream view. This will be an interesting beginning."

In this direction, he believes Sand.ai is not late at all — but rather, first to arrive.

Earlier in his career, Cao worked at MSRA's (Microsoft Research Asia) Visual Computing Group, a star-studded incubator that produced Sun Jian, He Kaiming, Cao Xudong, Zhang Xiangyu, Ren Shaoqing, Dai Jifeng, and others.

In 2021, as one of four core first authors, he published Swin Transformer, which won that year's ICCV Best Paper Award (Marr Prize). Afterwards, driven by thinking about organizational forms closer to a "Chinese OpenAI," he joined the Beijing Academy of Artificial Intelligence (BAAI).

Then came the tidal wave of ChatGPT. This post-90s researcher, who before age 30 had never considered entrepreneurship and idolized Ilya, was suddenly plunged into the real commercial jungle.

Recently, we sat down with this young founder to talk about his adventure.

A video generated by Magi-1.

The conversation follows:

Part 1: A Different Story from Sora

"Waves" (暗涌): After Lightyear Beyond was acquired by Meituan in late June 2023, you largely disappeared from public view. What were you doing at that time?

Cao Yue: I left around August or September. I felt I needed time to think and figure out what to do next. A friend described me then as very much like a tree in winter. Nothing seemed to change on the surface, but the roots underground were growing furiously. Overall, I really enjoyed that process.

"Waves": Why did you choose to start your own company after leaving Lightyear?

Cao Yue: Pursuing extreme personal growth is a very fundamental drive for me. Entrepreneurship felt like such an option. For a long time, I didn't know exactly what I wanted to do, but I was very clear about what I didn't want.

For example, I didn't want to stay in a stable system where the path ahead was visible. Because in that kind of track, there wasn't the growth I wanted.

Even though entrepreneurship is incredibly tough, most founders gain far more refinement and growth than ordinary people. My time at Lightyear also deepened my understanding of entrepreneurship. I realized this was the career and state I'd been pursuing.

"Waves": In December 2024, Sora also released its demo. Did that influence your eventual decision to enter video generation?

Cao Yue: Not directly. I decided on the video direction around November 2023. Before that, I'd looked at many directions, including companionship, agents, coding, and so on.

I ultimately chose video generation because it's a direction with both very high technical ceiling and very high commercial ceiling. A long slope, thick snow. If you think from the end backwards, AGI also cannot do without compression of video data.

"Waves": But in the video generation track, compared to companies that entered earlier and have already bombarded the market with products, this timing makes you a latecomer.

Cao Yue: It's actually hard to put everyone in the same race. I think in the direction of using autoregressive technology to compress video data, we are the pioneers. This path is not equivalent to video generation; video generation is just its first clear application scenario.

"Waves": At that time, more companies were probably thinking about replicating Sora, while you told a completely different story.

Cao Yue: First, Sora's current form may not even be OpenAI's goal in this direction — it might even be a smokescreen.

Thinking from first principles of technology, Sora's technical route has clear problems, and its ceiling likely isn't high — not sufficiently scalable.

When the upper limit of Sora's generation quality isn't high enough, its contribution to AGI itself is limited. When the company was first founded, we had intensive discussions to find a solution closer to the endgame.

"Waves": Did you find an answer?

Cao Yue: We believe video generation needs AR (autoregression). From our perspective, this is a solution closer to the endgame.

Magi-1, which we just released, is a beginning — the first milestone of the AR route. It proves that the AR approach is feasible, and its effects reach first-tier levels among video generation models on the market.

For the entire community, the AR route has produced a model completely competitive with pure diffusion routes. I think that's quite interesting.

"Waves": Why are you convinced that the autoregressive route is the solution closer to the endgame?

Cao Yue: This is our intuition about the technical direction. We believe video is ultimately causal in the temporal dimension. Like language models, you can only read text sequentially, from top-left to bottom-right — no one reads backwards. Video is the same. Many physical laws are essentially functions that change over time.

But Sora doesn't have these settings. In early Sora or Sora-like solutions, when a person walks, you often get cases like left leg, left leg, right leg, right leg — instead of: when the left leg was forward in the previous second, the right leg should come next. This is because during training, the model only learned temporal correlation, not temporal causality.

Temporal causality is one dimension. I also believe the autoregressive route is more scalable.

"Waves": How does the difference in temporal causality directly translate to different user experiences?

Cao Yue: From the most fundamental "model capability" perspective, it should be open-ended and a posteriori.

From the product perspective, it unlocks some natural product characteristics. For example, breaking free from duration limitations, and enabling more granular control in the time dimension — Sora's storyboard tries to do this, but with poor results because of the model's inherent limitations.

"Waves": However, at that time, the DiT architecture having a relatively low ceiling while the AR architecture having potential to break through that ceiling was somewhat industry consensus.

Cao Yue: At that time, the importance of this direction was relatively consensus, but there wasn't a public, widely recognized solution. And when to do it, how to do it, and how far to explore it — these make enormous differences.

Some companies might treat making Sora as step 1 and AR as step 2, but directly incorporating the autoregressive route into the video generation task — we were probably among the earliest to do that. In fact, we've already waded through many parts that other companies haven't had time to do yet.

"Waves": From afar, as early as December 2023, Google Research launched a video generation model based on autoregressive architecture. Looking closer, in OpenAI's GPT-4o, including DeepSeek's released multimodal large model and ByteDance's published papers, autoregressive architectures or models are also mentioned. How are these different from what you're building?

Cao Yue: All of these involve autoregression, but at very different granularities. Take 4o — to me, what's more crucial is that it turns image generation and language models into one model, unifying so-called multimodal generation and understanding.

In autoregressive architecture, video generation requires causality in the time dimension. This concept may have been thought about or mentioned by many, but actually putting it into practice, at relatively large scale, making it work, and making it work well — we haven't seen that yet.

"Waves": So this wasn't an easy decision to make?

Cao Yue: People and organizations who can make judgments about technical routes are relatively few — OpenAI already handed you the homework, yet you choose another path.

Often, following is easy because you can easily convince others and the whole organization. Not following is much harder. Everyone needs to realign, build shared understanding, and because it presents more challenges, it will also be much slower.

Part 2: Identifying the True First Curve

"Waves": What do you think of Kling 2.0's release? It's said to have systematically studied the scaling law characteristics of video generation's DiT architecture. And when Kling first launched in June last year, did it shock you?

Cao Yue: Studying and understanding scaling laws is the first step. The key is still which path — DiT or AR — is more scalable. We're still in early days, with a long road ahead.

At that time, Kling came out so quickly with quite good results — that was surprising, since other companies basically achieved that around September or later.

"Waves": What do you think might be the secret to its speed and efficiency?

Cao Yue: There are many factors; it's hard to make direct attribution.

Broadly speaking, they had accumulated capabilities in both engineering and organization. Video data processing is quite different from language models — its storage overhead, processing overhead are all much larger. For a startup, engineering support is really needed. Additionally, organizationally, at that point in time, they were very united, very much like a startup.

"Waves": In your view, what are the differences in approach between Kuaishou's Kling and ByteDance's Jimeng?

Cao Yue: From the product side, Kling seems more focused on being a good tool — model as product — while Jimeng wants to build a new type of content platform, or an AI video version of Douyin.

"Waves": Currently, looking at overall funding market activity, this track's volume has decreased quite a bit. Not only have major platforms secured more favorable positions, but some large model startups are also pushing in this direction. Do you think startups still have a chance to break through?

Cao Yue: Long-term, whether in technical ceiling or commercial ceiling, I absolutely believe this direction can support a good startup. Midjourney previously achieved $200 million to $500 million in annual revenue, with very high gross margins. I think the entire market can absolutely reach that level this year, or even higher.

If there's a depressed state, it's because model quality isn't good enough, not in the first tier. From another perspective, if a startup reached Kling's position, would it lack money?

Also, for startups, I think the main thread is very important. Every decision determines what kind of company you want to become. Founders must first identify what the true first curve is, then go deep enough in that direction, iterate continuously. You have to ask whether so-called second curves, third curves exist because the first curve was never stable to begin with.

In this era, if you don't have sufficient judgment about technology itself, don't know which direction to develop, it's hard to feel secure.

"Waves": How's your sense of security? What do you rely on to make strategic choices?

Cao Yue: I'd say I'm strategically optimistic, tactically pessimistic. Strategic choices often depend on the team and CEO's thinking about their own endowments and long-term chances of winning. For example, are you fundamentally a company that can make great models, or one that's more agile at productization and commercialization?

My background determines that I'm better at model and algorithm-related problems. Our team was also built from the start as a company relatively close to the model side. These give us a relatively short decision chain on technical matters — this is our strength.

"Waves": And the weakness?

Cao Yue: I'm a first-time founder. I haven't done products, haven't done commercialization — this part definitely requires time to learn.

"Waves": So you tend toward a strategy of amplifying your strengths.

Cao Yue: Looking at DeepSeek, you could say it's sometimes strategically conservative. I think the key is that in execution, it has very full and accurate understanding of its own capabilities — this is most critical.

You should grasp what's most important to you. At the current stage, I believe technology is still most important.

"Waves": But not all startups have DeepSeek's real-world conditions. Many startups can't focus precisely because of more direct commercialization pressures.

Cao Yue: Our product was just released; we'll consider commercialization later. Probably consider going overseas to some high-value regions first. Overall, we'll still do more model-as-product rather than heavy productization. Model products can actually be used in many scenarios.

Also, unlike language large models, video generation is relatively close to commercialization. Runway, Kling, including Hailuo AI all have very high revenue.

The reason is simple: human productivity in video is just terrible. Even though current quality isn't good enough, hallucinations are quite serious, and many complex motions are still impossible — even in this state, it already satisfies many scenarios.

For example, if you draw 20 times and get one good result, it's still far more convenient than traditional filming where you need to build sets, prepare props, fly actors in and provide accommodations. And video's penetration rate is an order of magnitude higher than text. If its production cost and cycle can come down, the value created for the entire market is enormous.

"Waves": Whether Li Yanhong or Allen Zhu, both have previously expressed varying degrees of pessimism about the video generation direction.

Cao Yue: I'm not sure how others see it, but I think if you think toward AGI, video may be another critical data type beyond language. It's more accessible, sufficiently rich, sufficiently diverse, and the information itself is relatively self-consistent.

It may be some kind of connection between the virtual world and the real world, while language leans more toward the virtual world.

Moreover, language models have already developed to a relatively late stage; video is still in a relatively early stage. I think future true video foundation models, or long-term "so-called world models," will emerge in this direction.

Part 3: Some Philosophies of Doing Things

"Waves": In this fragile era, why did you name the company Sand.ai?

Cao Yue: The main constituent element of sand is silicon. We carbon-based humans are now essentially at the silicon frontier, and in speculation, silicon-based life forms all feed on sand.

"Waves": In 2023, what was the catalyst for Wang Huiwen pulling you into entrepreneurship? What did you mainly do during those four months?

Cao Yue: Everyone was attracted by the grand vision of AGI. During my months at Lightyear, I mainly recruited people.

"Waves": What was your direct impression of him?

Cao Yue: This person is really strong. He can articulate very sharp perspectives, and knows in what scenarios they'll work. When he wants to present a point, he can give you a 10-minute, 40-minute, or even 1-hour or 4-hour version depending on your situation and comprehension level at that moment.

Often, the points he tells you require repeated pondering, chewing over, understanding in different states. You feel that many methodologies truly come from practice, not from books.

"Waves": For example?

Cao Yue (thinking for a long time): Around 2021, I had been studying OpenAI and DeepMind for a long time, and when meeting people I would often ask why China hadn't birthed such organizations.

The first time I met Wang, I asked him too. His perspective was: because China wasn't rich enough before.

"Waves": Do you agree with this answer? For some highly profitable major platforms, this argument doesn't quite hold.

Cao Yue: Before 2024, compared to Silicon Valley, I think this statement held true.

Of course this may be one side; the other side may be that we got rich too fast. The psychological state hasn't adjusted, so there's also the saying "take a generation."

But from the beginning of this year, after DeepSeek and several other companies emerged, there have clearly been very different signals. As Liang Wenfeng said, we just need some facts and a process. We're already in that process.

"Waves": Why were you particularly curious about OpenAI and DeepMind at that time?

Cao Yue: I was somewhat lost myself then. Though we had done some work, there was clearly an essential gap compared to more impressive things.

In 2020, the US produced AlphaFold 2, GPT-3, and seemingly less influential works like DALL-E, CLIP. When I carefully studied OpenAI's previous work, I felt these people's way of working, thinking, and organizational form were quite different from ours.

Domestically, it was still generally paper-driven, while they were some kind of more organized research. They didn't pursue whether methods were novel, but pursued solving essential, important, influential problems. When I discovered this gap, I couldn't stay at MSRA any longer.

"Waves": Was this why you decided to join BAAI in 2022?

Cao Yue: Yes. Many Chinese institutions' organizational forms couldn't escape being paper-driven. But if what you pursue is publishing papers, choosing a niche problem may be easier for publishing than a widely concerned one.

MSRA was already one step further than most organizations, shifting from paper-driven to impact-driven, and BAAI was one step further than MSRA.

When I went, BAAI had just undergone some changes. Early on it supported some university teachers doing organized research inside; later it recruited internal researchers and built a certain value system to do more influential, more exploratory things in an organized way.

Because it's a non-profit institution, you could eventually open-source the results. At that point in time, BAAI was the organizational form with the most chance of approaching OpenAI.

"Waves": Has doing influential things always been attractive to you?

Cao Yue: I joined MSRA's Visual Computing Group in 2018. This group is probably the top domestic group doing deep learning. It previously gathered people like Sun Jian, He Kaiming, Zhang Xiangyu, Ren Shaoqing, Cao Xudong, Dai Jifeng, and others.

Though I wasn't completely in the same period as all of them — some I met later — I later concluded that we were all doing research with somewhat similar methodologies, with some transmission of doing-things philosophy.

"Waves": For example?

Cao Yue: My later summary is: do the most essential, most critical, most widely concerned problems. The most important problems are essentially like the 100-meter dash in the Olympics — you need to hone yourself to approach some kind of limit, build deep and fundamental understanding of the problem, make experiments sufficiently solid, sufficiently fine-grained, to have a chance at a little progress on truly the hardest problems. And important problems are like a kind of leverage — once you have genuine progress in them, they produce enormous influence.

"Waves": Before 2023, had you ever imagined being this close to business? Your idol then should have been a scientist, not an entrepreneur.

Cao Yue: It was Ilya. This is easy to understand — he was very close to the direction I worked in. In deep learning, he's also one of the few who can connect the important nodes of the entire AI era, and has produced real, enormous value for this world.

"Waves": Is this also the meaning of your entrepreneurship this time?

Cao Yue: In recent years, one of my biggest realizations is that life is quite short. For me, business is a form of leverage; the goal is still to produce some interesting value for this world.

Image source: Unsplash

Recommended Reading

Where Money Flows, People Rise and Fall