Back to Reality! Sora Breaks Through — 4 Key Questions We're Watching | Yunqi FutureScope

云启资本·February 22, 2024·1·0

Creating Virtual Worlds, Returning to Real-World Problems

The image above is from Sora's official landing page — countless paper airplanes soaring freely into the depths of a forest, embarking on new explorations. In Japanese, Sora means "sky," with a further connotation of "freedom" and "boundless creativity."

Sora's surprise release sparked a wave of discussion nearly rivaling the frenzy around ChatGPT-3.5. This text-to-video model demonstrated a "new generational leap" in its comprehension of language, consistency across multi-shot generation, image clarity, ability to construct complex scenes, and simulation of the physical world.

The AI-watchers at Yunqi Capital — our "Yunqi-ers" — eagerly dove into the conversation from the moment of its release in the early hours of February 16. During the first week back at work, colleagues in the US and China held in-depth discussions on where Sora's "magic" comes from and what new impacts it might have on the AI landscape.

In this edition of "FutureScope," we've excerpted some highlights to share. "One day on the ground, one year in AI" — for a technology theme still evolving this rapidly, we've kept much of our discussion open-ended. We welcome founders, technical experts, and industry specialists to share your latest insights in the comments.

1. OpenAI Again?

How does Sora achieve such remarkable capabilities? According to OpenAI's official explanation, DALL·E 3's Re-captioning feature adds "high-quality" text descriptions to training video materials, thereby improving the overall quality of output videos. Additionally, brief prompts are first expanded by AI before being passed to Sora for generation — the ability to do Re-captioning and prompt expansion stems from OpenAI's possession of the most powerful foundation model available. The model's inherent capabilities enable not only high-quality descriptions but also directly reduce data annotation costs.

Achieving "cost reduction and efficiency gains" through model capability alone, OpenAI maintains an overwhelming lead. GPT-4 — simple analysis of multimodal content — generation of multimodal content: Sora's success confirms that this pipeline is currently the optimal solution for technology products. Tracing back to the source, the magic of video still rests upon GPT-4. Most domestic foundation models in China still have ground to make up.

We've been tracking foundation models since 2021 and led the angel round in model company MiniMax. In January 2024, MiniMax released its self-developed large model "Abab6," whose benchmark results already significantly outperform ChatGPT-3.5.

Sora-generated Lunar New Year of the Dragon

2. Is Being a "World Simulator" Good Enough?

Based on published demos, as long as prompts conform to basic physical laws, most videos Sora produces are indistinguishable from reality. And the more specific and detailed the request, the closer the generated video approaches reality even with complex background details. For prompts that violate physical laws or common sense, Sora's videos can be identified as computer-generated effects; additionally, some published demos show imprecise rendering of water droplets falling or dog fur rippling. OpenAI's technical report reveals that Sora can understand the physical world in motion but still struggles to accurately simulate certain interactive physical properties.

Empowered by LLM cognition, Sora makes generative video a true reflection of the physical world. Yet must video "creation" strictly conform to reality? In beloved animated works like 5 Centimeters per Second and My Neighbor Totoro, the most emotionally resonant moments often defy physical laws, existing only in imagination. To truly become a mature tool for video creators, both "reality" and "imagination" are indispensable.

By Sora

3. Sora and AI Applications: New Opportunities vs. New Challenges?

OpenAI's sustained leadership in multimodal content generation further compresses the market space for single-modal models. Elon Musk commented on Twitter: "AI-augmented humans will create the best work in the coming years." — Musk's timeline is "the coming years"; for now, Sora's applications in video creation, digital humans, and virtual worlds have yet to truly unfold, and its value in commercial video production workflows remains to be validated. Our GenAI portfolio companies have already obtained Sora beta access and are actively exploring how new content generation methods could positively impact their businesses.

Text-to-video technology remains in testing and hasn't been fully opened for general use. Generating complex videos requires processing tens of thousands of causal relationships and logical inferences in the physical world, compounded by persistent issues like hallucination — numerous challenges remain before large-scale commercial deployment.

By proposing the concept of a "world simulator," OpenAI continues its exploration of parallel "digital worlds." Returning to commercial application logic: how can technological progress truly benefit production and daily life? We believe technology must connect with the physical world — embodied intelligence, autonomous driving, and similar domains. Some argue that Sora's video recognition capabilities could potentially enhance the generalizability of autonomous driving.

4. Where Are OpenAI's Boundaries?

Sora's success extends OpenAI's playbook of "brute-force aesthetics." There was no major theoretical breakthrough; the algorithmic model combines existing approaches in a hybrid application; data sources and quality were not disclosed, and engineering details remain undisclosed — which is, of course, precisely where the deepest moat lies.

Following this path, OpenAI transformed from a self-described "research institution" nonprofit into the most closely watched star company in tech. While naturally leading in model cognition, compute costs remain an unavoidable challenge. Roughly a week ago, Sam Altman announced OpenAI's chip-making initiative, as the company currently requires massive quantities of GPUs for training computations. To launch this program, he's seeking to raise $7 trillion.

What other changes will Sora bring? We posed this question to GPT-4. Below is our conversation transcript:

Hi, as an AI industry expert, imagine a fundamental model that generates 1-minute high-fidelity video and simulates the workings of the physical world. What impact will this model have on the world?

ChatGPT 4

The impact of such a model on the world could be significant. It could revolutionize industries such as manufacturing, architecture, and entertainment by enabling more accurate and efficient simulations of physical systems. This could lead to cost savings, faster development cycles, and more innovative solutions. Additionally, it could have implications for education and training, allowing for more immersive and realistic simulations of real-world scenarios. Overall, a base model that generates 1-minute high-fidelity video and simulates the workings of the physical world has the potential to have a transformative impact on various industries and society as a whole.

"The impact of such a model on the world could be significant. It could revolutionize manufacturing, architecture, and entertainment by enabling more accurate and efficient simulations of physical systems. This could lead to cost savings, faster development cycles, and more innovative solutions. Additionally, it could have implications for education and training, allowing for more immersive and realistic simulations of real-world scenarios. Overall, a foundation model that generates 1-minute high-fidelity video and simulates the workings of the physical world has the potential to have a transformative impact on various industries and society as a whole."

What new changes do you think Sora will bring? We welcome your comments and engagement below! Our observations and reflections on AI will continue — see you next time on "FutureScope"~