OpenAI Sora Launch: The Business Insights and Tech Innovation Behind It

OpenAI Sora Launch: The Business Insights and Tech Innovation Behind It

February 22, 2024

On February 16, OpenAI's text-to-video model "Sora" burst onto the scene, sending shockwaves through the industry.

From the landing position of a Dalmatian in a generated video to the cursor color in a game demo, to the vision of a "world simulator" — our thinking had already taken flight long before Sora became publicly available. Through in-depth conversations and clashing perspectives, we attempt to grasp the true currents beneath the froth.

In this episode, we invited Yusen Dai, managing partner at ZhenFund, and Peak Ji, a ten-year AI entrepreneur, to discuss the significance of Sora from the perspectives of investor and founder. We also explore: Is Sora the so-called GPT moment? What does Sora's breakthrough mean for startups? And what exclusive observations do we have on recent AI application entrepreneurship and investment?

Host: Yan Xie, Vice President of Investment at ZhenFund

Guests: Yusen Dai, Managing Partner at ZhenFund

Peak Ji, ZhenFund EIR and Founder of Magi

Episode Highlights

First Impressions of Sora

  • 03:04 Logical and consistent: a major breakthrough in world simulation
  • 08:35 What investors care about: compute cost, training data, simulation quality
  • 11:06 Product perspective: video generation speed, ease of use
  • 12:57 Hollywood 3D rendering takes days too: what does current speed mean for large models?

Overestimation and Underestimation of Text-to-Video Technology at Present

  • 15:49 Underestimated: the potential to scale up; Overestimated: current model performance
  • 18:19 Must one understand the underlying laws of the world to produce reasonable behavior?
  • 19:40 The scapegoat theory: why can't we trust AI?
  • 26:49 Early in technological progress, the barrier from 0 to 1 is high
  • 29:52 Venture capital is about finding the beer beneath the foam
  • 32:47 A truly general-purpose world simulator may not be Sora

Two Technical Paths: Diffusion vs. Autoregressive Transformer

  • 33:36 Sora continues the diffusion model lineage; VideoPoet continues the autoregressive model lineage
  • 35:28 Large models "kill" small models: general models unify different tasks
  • 36:34 Building a broader ecosystem based on general-purpose capability

Is Sora the ChatGPT Moment for Video Generation?

  • 37:56 How do we define a ChatGPT moment?
  • 39:48 Why Sora is not the ChatGPT moment for video generation

Opportunities and Challenges

  • 45:43 How should we view "wrapper companies"?
  • 47:11 Don't be a "sashimi-style" startup
  • 48:40 Building tools doesn't mean you can't make money
  • 56:50 Technological innovation and demand insight: the two axes of AI entrepreneurs
  • 01:00:56 The AI Native era hasn't yet produced a particularly strong business model
  • 01:07:17 World simulators may rewrite humanity's definition and perception of reality
  • 01:08:07 High-speed simulation of the world is also conservation and expansion of life

Further Reading

Deep Dive into Sora: Technical Surprises and Disappointments, the Possibilities and Imagination of "World Models" | Crossover with OnBoard!

Reference Materials

Transformer Architecture

Transformer models are essentially pre-trained language models, mostly trained using self-supervised learning on large amounts of raw corpora. In other words, training these Transformer models does not require human-annotated data. Autoregressive models are one type of Transformer model.

Autoregressive Model

The Autoregressive Model, commonly abbreviated as AR model, is a statistical model used for time series analysis and forecasting. It predicts future values based on a time series' own historical values, modeling the relationship between observations at the current moment and previous moments.

Diffusion Models

Diffusion models can generate target data samples from noise (sampled from simple distributions). By learning the inverse operation of gradually turning images into pure noise, diffusion models can transform any pure noise image into a meaningful image, thereby completing image generation.

VideoPoet

VideoPoet is a large language model focused on video generation released by Google at the end of 2023, capable of performing various video generation tasks including text-to-video, image-to-video, video stylization, video inpainting and extension, and video-to-audio. Unlike the vast majority of video domain models, VideoPoet does not take the diffusion route but instead develops along the transformer architecture, integrating multiple video generation functions into a single LLM, confirming transformer's enormous potential for video generation tasks.

Staff

Executive Producers: Jiafen, Yifei

Post-production: Keyone Studio

About ZhenFund

"This Is Seriously True" (《此话当真》) is a general business podcast produced by ZhenFund, where the ZhenFund investment team will share the latest hot topics and industry insights with leaders from various fields.

Founded in 2011, ZhenFund is one of China's earliest angel investment institutions. Since its inception, ZhenFund has been actively seeking out the most outstanding entrepreneurial teams and era-defining investment opportunities in artificial intelligence, chips and semiconductors, robotics and hardware, healthcare, enterprise services, new energy, cross-border expansion, consumer lifestyle, and other fields.

ZhenFund — your first stop for entrepreneurship!

Contact Us

WeChat Official Account: ZhenFund (ID: zhenfund)

Website: www.zhenfund.com

Email: media@zhenfund.com

You can listen to us on Xiaoyuzhou, Apple Podcast, and Ximalaya.

If you have any suggestions or expectations for the show, we welcome your interaction in the comments ~