Sora Is Finally Here. Is It as Powerful as You Imagined? | Yunqi Tech π

云启资本·December 10, 2024

Late to the party — where's the surprise?

After a ten-month delay, Sora — long hyped as the "ceiling of AI video generation" — is finally available to the public. In the early hours of December 10 Beijing time, OpenAI officially unveiled Sora on day three of its Shipmas event.

Much has changed in AI video generation over the past ten-plus months. Players old and new, domestic and international, have raced to ship ready-to-use products, with meaningful breakthroughs in duration, resolution, consistency, and other key dimensions. With user expectations continually ratcheting upward, is the belated Sora still stunning enough?

In this episode of "Yunqi Tech π," we take you through Sora's features.

This article is republished from GeekPark.

Original title: "OpenAI Officially Releases Sora: A Complete Guide to What Makes Its Text-to-Video Feature So Powerful"

Author: Shiyun Li

As the outside world had speculated, on the third day of its 12-day livestream, OpenAI officially released its text-to-video product, Sora.

At 2 a.m. Beijing time on December 10, Sam Altman and several OpenAI employees demonstrated Sora's capabilities and real-world use cases via livestream. Following the release of sample videos in February this year, Sora sparked a global AI frenzy, after which AI companies at home and abroad rolled out their own text-to-video products. As the pioneer of this赛道, Sora has finally lifted the veil today.

Overall, the range of product features Sora demonstrated shows that it surpasses current text-to-video products in video generation quality, functional originality, and technical complexity.

Building on the foundational text-to-video and image-to-video capabilities, it adds features like storyboard (essentially creating your own story through shot breakdowns), text-based video adjustment, and blending of videos from different scenes (equivalent to adding special effects directly to video). The entire product design seems geared toward bringing video closer to creators' self-expression and helping them realize an ideal cinematic story.

Late on December 9 local time, users in the United States and most other countries could access Sora through the official website. It is included in ChatGPT Plus and ChatGPT Pro subscriptions at no extra charge. Plus subscribers can generate up to 50 premium videos at up to 720p resolution for 5 seconds each, while Pro subscribers get up to 500 premium videos at up to 1080p resolution for 20 seconds each, plus watermark-free downloads.

Sam Altman outlined three reasons for building Sora:

First, from a tools perspective: OpenAI loves making tools for creative people, which is important to the company's culture.

Second, from a user interaction perspective: AI systems can't interact solely through text; they should also understand and generate video to help humans use AI. This echoes what domestic large-model companies have said — "every time a model expands its modality, user penetration rises."

Third, from a technical perspective: this is critical to OpenAI's AGI roadmap. AI should learn more about how the world works — the so-called "world model" that understands the laws of physics.

Using technology to change the world while using products to catalyze human creativity — that's what Sora is doing.

01 Beyond Video Generation: Storyboarding, Special Effects, and Infinite Creation

At its most basic, Sora offers text-to-video and image-to-video generation. Opening the main interface, users can view and manage all their video generation content, switch between grid and list views, create folders and favorites, and check bookmarks. Researchers said this main interface design is meant to better help users craft stories. At the bottom center of the home page are Sora's text-to-video and image-to-video functions.

For example, Sam Altman first entered the text prompt: "A woolly mammoth walks through a desert, shot with a wide-angle lens." Then he needed to select the aspect ratio, resolution, duration (5-20 seconds), and number of final generated videos (up to four clips to choose from) before getting the generated video. Ultimately, the generated video proved highly realistic and textured, and largely followed the input instructions. Sora's outstanding video generation quality may not come as a surprise.

After entering the text "A woolly mammoth walks through a desert, shot with a wide-angle lens," Sora generated four video clips | Image source: OpenAI

But this time, Sora also released a series of unique, advanced product features. In GeekPark's view, these functions essentially revolve around more precise video expression — enabling people to create the stories they want through shot breakdowns, special effects, and other means.

First is storyboard, which researchers called a "brand-new creative tool."

From a product design perspective, it essentially slices a story (video) into multiple story cards (video frames) along a timeline. Users only need to design and adjust each story card (video frame), and Sora will automatically stitch them into a coherent story (video) — much like storyboards in film or animation manuscripts: once the director draws the storyboard, the film is essentially shot; once the comic artist completes the manuscript, the animation is designed.

For instance, the first shot a researcher envisioned was: "A beautiful white crane stands in a stream, with a yellow tail." The second shot: "The crane dips its head into the water and catches a fish." What he did was create these two story cards (video frames) separately, with roughly a five-second gap between them. This interval matters to Sora — it gives the model room to connect the two actions.

In the end, he got a complete video shot: "A beautiful white crane stands in a stream. It has a yellow tail. Then the crane dips its head into the water and catches a fish."

Through two story cards (video frames), Sora generated a complete story (video) | Image source: OpenAI

What's even more remarkable is that on this storyboard, creative elements aren't limited to story cards — they can also be direct images or videos. That is, any image or video can be dragged onto the storyboard and combined with story cards for creation.

Taking video as an example, researchers cut the above crane video and imported it into the storyboard, making edits. This left gaps before and after the video for continued creation — meaning new beginnings and endings are possible.

The implication is that the storyboard enables infinite creation. A 20-second Sora video can be endlessly created, cut, created again... until it perfectly matches the ideal shot in one's mind. This process resembles an editor or director slowly cutting their envisioned film through iterative storyboard design and footage generation.

Unlike the real world, Sora offers unlimited source material. And unlike other text-to-video products, Sora's videos can be modified and refined. This means its generated videos will inevitably align more closely with users' imaginations and creativity.

This seems to be the core philosophy behind Sora's product: to make generated videos match users' creative vision as closely as possible.

This helps explain Sora's other features: modifying videos directly through text, seamlessly blending two different videos, changing the visual style of a video — essentially adding "special effects" directly to video. With typical text-to-video products, users might need to constantly adjust prompts and regenerate videos from scratch.

Users can directly adjust videos by modifying text | Image source: OpenAI

Sora can merge two video clips into a seamless edit | Image source: OpenAI

Overall, beyond Sora's unsurprisingly excellent video generation, it brings more unique video creation features — equivalent to adding storyboarding, editing, and special effects to video. This means everyone has the opportunity to create what they truly want to express, moving closer to being a director.

"If you come into Sora with the expectation that you just click a button and generate a movie, I think your expectation is wrong," an OpenAI researcher said.

He explained that Sora is a tool that allows people to try multiple ideas in multiple places simultaneously, attempting things that were previously completely impossible. "We actually think of this as a super special extension of the creator."

02 Serving the Masses Without Separate Charges, Still Relying on Foundational Model Capabilities

As the pioneer of the text-to-video赛道, Sora's launch came latest among its peers. In response, the OpenAI research team explained that extensive deployment of Sora required finding ways to make the model faster and cheaper. The team invested substantial work toward this end.

During the livestream, OpenAI announced the launch of Sora Turbo, a new high-end accelerated version of the original Sora model. It incorporates all the capabilities discussed in OpenAI's earlier "World Simulation Technology" report, plus additional features for generating video from text, animating images, and blending videos. This is the technical foundation behind Sora's product features.

While video inference appears costlier than text, OpenAI is not charging separately for Sora this time. Both the $20/month ChatGPT Plus membership and the $200/month ChatGPT Pro membership include access to Sora.

Plus benefits include up to 50 premium videos at 720p resolution for 5 seconds each. Pro benefits include up to 500 premium videos plus unlimited standard videos at up to 1080p resolution for 20 seconds each, with watermark-free downloads.

Sora usage allowances by membership tier | Image source: OpenAI

Sora's significance to OpenAI extends further. The team found that video models exhibit many interesting emergent capabilities when trained at scale, enabling Sora to simulate certain aspects of the real world involving people, animals, and environments. "Our results suggest that scaling video generation models is a promising path toward building general-purpose simulators of the physical world."

Perhaps precisely for this reason, getting Sora into widespread use as quickly as possible — using data to better train the world model — matters so much for OpenAI's ultimate AGI vision.

On the path of iterating technology, they have also advanced human creativity along the way.

"This version of Sora will make mistakes. It's not perfect. But it's at a point where we think it will be incredibly useful for augmenting human creativity. We can't wait to see what the world does with it," said the OpenAI that created it.