Haiper's Yishu Miao: Video Generation Is Still at the "GPT-2" Stage | 5Y View

五源资本·August 14, 2024·6·0

Before becoming the TikTok of the GenAI era, Haiper is trying to become Xiaohongshu first.

Recommender

Steven Shi, Vice President at 5Y Capital

Tools and demand don't always go hand in hand. I believe that in GenAI, tools must precede demand. We're still in the stage of inventing and improving new tools. Only when tools mature enough will we enter the phase of unlocking new use cases.

We can't assume, just because short video succeeded in the previous generation, that video generation products will naturally leap ahead to creating new media formats and defining new consumption loops. On the contrary, video generation is still in its very early days — there's still ample exploration space in terms of data recipes and algorithmic architectures for scaling.

I believe that new modalities where methodology remains non-consensus and uncertain are precisely the right timing for top-tier researchers and engineers like the Haiper team to start a company. I welcome more people at this "immature" stage to reach out and explore together: Stevenshi@5ycap.com

Has the "ChatGPT moment" for video generation actually arrived?

When Sora launched this February, many believed that moment had come. OpenAI used a Transformer-based Diffusion Model, enabling Sora to generate videos up to 1 minute long — at the time, Pika could only generate 3 seconds, and Runway 18 seconds. Moreover, Sora's videos "moved" more than those from Runway and other models, looking more like genuine motion rather than GIFs. However, to this day, all of Sora's releases have remained demos, open only to select testers for limited access, rather than making the functionality available to everyone to experience as ChatGPT did.

Patrick Cederberg, a post-production member of Toronto-based video production team Shy Kids, noted after testing that among roughly 300 videos generated by Sora, only about 1 was usable — meaning Sora's "gacha rate" (i.e., yield rate) was insufficient.

If the video generation market has two factions — the duration camp and the yield camp — with Sora representing the former, London-based startup Haiper belongs to the latter. Haiper founder Yishu Miao told Neocortex that while Haiper's video model currently generates only 8-second videos, "Haiper's gacha rate is quite high — approximately 1 out of every 2 videos gets downloaded and used by users."

In Miao's view, the reason for prioritizing yield over duration is that users don't actually expect ultra-long videos, and the "optimal duration" for video generation remains under exploration. Meanwhile, a 4-second generation length can already accomplish a great deal for users — such as serving as advertising content or telling a short story.

Many people expect video generation to produce a TikTok for the GenAI era, with the difference being that videos would come from AI generation rather than camera capture. Miao also believes such a platform will emerge, though he contends that even if current models solve the yield problem, they still won't solve the storytelling problem — high-quality creation remains the hardest challenge. For now, this work still requires human intervention. Thus overall, video generation currently stands at roughly the GPT-2 stage of language models: capable of generating content resembling natural language and video, but the question of "whether the content is meaningful" remains unsolved.

Haiper is building a community where professionals and novices can exchange video generation experiences. In Miao's words, this community resembles Xiaohongshu more than TikTok — for video generation, the former functions more like a learning community for exchanging video generation experiences, while the latter would be a marketplace for trading finished AI videos. Before becoming the TikTok of the GenAI era, Haiper is trying to first become the Xiaohongshu of the GenAI era.

This article is adapted from Neocortex's conversation with Yishu Miao, in which he discussed the differences in target audience between Haiper and Sora, Haiper's current product positioning, and the competitive factors in the current video model landscape. He also talked about what he gained from working at DeepMind, and his understanding of why DeepMind was overtaken by OpenAI.

Success Rate Matters More Than Duration

Neocortex: As a video generation company, does Haiper's technical approach more closely resemble Runway, Pika, or Sora?

Yishu Miao: I'm not certain about the specific technologies other companies use, but what I can say is that different companies' video generation models will differ significantly in architectural details. Video generation is a complex engineering system — from the data layer to model architecture design to final output selection, many factors come into play.

We use Latent Diffusion Model + Transformer. At this point, we can't simply say that using a particular architecture creates technical advantage. It's a process requiring continuous research and hybridization.

Because AI products have similar UI/UX designs, our product may look similar to Runway and Pika in early stages, but users will discover significant differences after using it.

Neocortex: After Sora's release, domestic video generation companies in China all seemed to pivot toward becoming the next Sora. In Silicon Valley or London, are there still attempts at different technical approaches in video generation?

Yishu Miao: I don't think a single dominant technology will emerge in video generation anytime soon. There may be general-purpose architectures similar to language models, but content diversity will lead to diversification in video generation models — platforms like YouTube, Bilibili, and Netflix already show clear differences at the content level.

Technically, the video generation industry is still in a very early stage without formed technical consensus, requiring continuous research push. For example, we might propose a video architecture, then encounter bottlenecks when scaling it, forcing us to propose new network architectures. This process becomes complicated by new algorithms, and previous training may become invalid.

Neocortex: Is Haiper's goal not to become the next Sora?

Yishu Miao: In my view, rather than calling Sora a video product, it's better understood as a significant milestone on OpenAI's path toward AGI. And for this project to become a mature product, it remains distant from ordinary users.

For startups, we need to be closer to users than large companies, considering why they want to generate videos and what they use them for. Different user needs lead to different technical approaches. We could pursue the route of continuously scaling up models, but what's harder is considering user usage rates and model iteration efficiency during training.

Selecting an excellent demo video is relatively easy, but reaching product-level quality and ensuring user satisfaction is entirely another matter. Truly commercializing a model and deploying it on the cloud for all users requires market validation.

Neocortex: In the current video model competition, what are the competitive factors for each company? Duration, resolution, coherence, stability, or alignment with user prompts?

Yishu Miao: All these factors matter, but with different emphases. Professional users may prioritize high resolution and duration, while ordinary users may care more about language understanding, coherence, and content interestingness. When these factors converge in a product, serving different user types requires trade-offs.

Neocortex: Currently, Haiper's maximum generation length is 8 seconds — there's still a gap compared to Sora in video length?

Yishu Miao: Technically, we've already achieved unlimited-duration video generation, but it's not ready for market release. Actually, simply extending video length isn't difficult, but as generation length increases, video content quality declines. (Note: On July 17, Haiper released version 1.5 of its video generation model, increasing video duration from 4 seconds to 8 seconds. The new model also added an upscaler that can enhance low-quality video to 1080p resolution, improving image quality and detail. Additionally, the model added image generation functionality, allowing users to check image effects before video generation to increase success rates.)

Neocortex: What does duration signify in current video generation competition?

Yishu Miao: In video generation, currently launched products typically don't support generating overly long videos, while products supporting long-form generation haven't reached production standards.

My view is that startups shouldn't pursue excessively large models and long durations from the start — this may deviate from application deployment goals. Beyond technical reasons, I believe users don't actually expect ultra-long videos. Discussing how many seconds a model can generate without considering video quality and user experience is meaningless.

Previous reports mentioned that videos ultimately released by studios collaborating with Sora were the result of post-production editing. For a particular shot, only 1 out of 300 videos generated by Sora was usable. Such a success rate is far from sufficient for product-level applications, because ordinary users can't wait to generate 300 videos and pick one. So I believe Sora actually targets professional users, but our strategy differs — we aim to provide something ordinary users can use.

For consumer-facing products, pursuing video duration early on isn't a wise choice. Pursuing long-form generation means requiring larger models, which leads to longer user wait times and increased probability of generating flawed videos.

We want ordinary users to obtain satisfying material in a short time, lowering the cost of trying our product. Current optimization directions include language understanding, creative style combinations, etc., but the most important factor is success rate. High success rate is an important product advantage — it means users can get satisfying video results faster, reducing waiting and filtering time.

Neocortex: What specifically does success rate refer to?

Yishu Miao: Success rate is also called "gacha rate" among users, referring to the number of usable videos among consecutively generated ones. For example, if we generate 10 videos consecutively and half are usable, the success rate (gacha rate) is 50%. Currently, based on user download rates, Haiper's success rate is quite high — approximately 1 out of every 2 videos gets downloaded and used.

Even When AI Can Generate Videos for Users, Storytelling Still Requires Human Work

Neocortex: Is Haiper's current product positioning as a tool, community builder, or technical effort toward AGI?

Yishu Miao: We're currently in a community-building phase; achieving AGI is our long-term mission. Before having clarity on the specific form of video generation AGI, discussing AGI is still too distant. Our ultimate goal with AI isn't AI itself, but finding our own path on the road to AGI — that's our long-term direction. But we won't abandon product deployment, which helps us interact with users and get real-world feedback, rather than just entertaining ourselves at the technical level.

We value content created by ordinary users enjoying AI, rather than just providing tools for Hollywood filmmakers. Early on, our product may look more like a tool, but our main next effort is building community, hoping to eventually develop from community to platform.

Neocortex: How do you view aesthetic diversity among different users?

Yishu Miao: Users' ability to appreciate video content is a concern, but it doesn't trouble us. AGI will inevitably diverge from humans — this is reality we must accept. AI-generated content sometimes doesn't match user expectations, but this divergence and aesthetic difference is unavoidable.

AGI is a collection of intelligence; it will have its own views, but doesn't need to satisfy everyone. Our goal with AI products is to meet most users' needs, not to pursue satisfying everyone.

Neocortex: Not offering too many possibilities — this might be a better commercialization path?

Yishu Miao: Right, professional users' aesthetics are indeed not easily expressed in a single model. AGI can't satisfy all consensus, but it can reach most people's aesthetic expectations.

Neocortex: The definition of community sounds broad — is it a community like TikTok, or like Character AI?

Yishu Miao: TikTok is already a mature platform where users create and consume content. The community we're talking about is earlier-stage, mainly building channels for users to exchange and share, establishing communication between professional and ordinary users, lowering the barrier to generating AI videos. Our community may be more like Xiaohongshu, where users can share video generation tips, prompts used, design ideas, etc., and other users can build upon these for secondary creation.

Neocortex: Does building this community imply that the barrier to generating consumable content with video models remains high?

Yishu Miao: Right, the reason for building a community before a platform is that video generation does have barriers — ordinary users find it difficult to create high-quality content in one step. Currently, we haven't reached a stage where most users only consume without participating in creation, so encouraging active user creation is important. Ultimately we hope the community can develop into a platform that not only supports user creation and consumption but also attracts new users to participate through these creations, forming a self-growing ecosystem.

Neocortex: Short video platforms like Douyin are also developing video generation tools to offer users — how does your product differ from theirs?

Yishu Miao: Large companies may focus on building tools and ecosystems around existing platforms; our emphasis is on exploring a new creation method. Although many people have used platforms like TikTok, few actually become creators because the creation barrier is relatively high. Currently, content creation on such short video platforms still requires camera involvement. We prefer using AI to generate video in one step, without needing a camera. Our service allows users to create video directly from text and images, which differs from existing mainstream creation methods.

Neocortex: Haiper currently generates videos of only 4 to 8 seconds — what can this duration accomplish?

Yishu Miao: A 4-second video can do many things, such as serving as advertising content or telling a short story. For longer stories, splicing and editing may be needed.

Neocortex: Through one-click generation, can currently consumable videos be produced?

Yishu Miao: One-click video generation is asking too much; current technology is better suited for multi-segment expression. It's not because 4 seconds is inherently difficult — storytelling itself is difficult, with extremely high requirements for content understanding. It's hard to tell a complete story or achieve a small humorous moment in 4 seconds.

Neocortex: Has AI lowered the barrier to video creation for users?

Yishu Miao: The barrier at the tool level has been lowered, but the barrier of storytelling itself hasn't been lowered — high-quality creation remains difficult. This isn't a tool problem, but a creative capability problem. Doing storytelling is something AGI could do; before AGI arrives, this is the hardest thing.

Neocortex: Could the storytelling work potentially be done by another AI?

Yishu Miao: Possibly, but not now. If AI could do storytelling, that would signal AGI's arrival. Before AGI arrives, storytelling is what we consider the hardest thing.

Neocortex: Would you consider adding storytelling agents to the community?

Yishu Miao: We will definitely try.

The ChatGPT Moment for Video Models Hasn't Arrived Yet

Neocortex: Currently, video products differ from language models in user base and activity levels — what causes this?

Yishu Miao: First, video model product maturity itself lags behind language models. Additionally, market education is insufficient — users may not yet realize AI can do many things with video content.

However, although video models are immature, some practical application cases have already emerged, such as commercial applications in advertising and other fields. Early language models like GPT-2 were mainly applied to sentiment analysis, classification, or content moderation — not large-scale applications. Compared to language models, video models are closer to consumers; even with immature technology and products, their generated content has greater value.

Neocortex: Has video generation reached its "ChatGPT moment"?

Yishu Miao: Not yet. A ChatGPT moment at minimum means everyone can use the technology, whereas currently Sora and similar products, while bringing new experiences, remain demos without large-scale application.

Neocortex: Which GPT stage does video generation technology currently correspond to?

Yishu Miao: Roughly GPT-2, but I believe it's not as primitive as GPT-2 in terms of applications.

Neocortex: A view exists in the language model field that expected results can be achieved with sufficient time and data volume — is the same true for video model development?

Yishu Miao: I have previous experience in language models; actually, language models aren't as simple as people imagine. Though time and data volume matter, merely increasing these doesn't guarantee success. First, there's a high barrier in engineering implementation — simply increasing data volume, scaling up models, or using more compute resources doesn't fully solve the problem. For example, a model trained on different numbers of GPUs produces different results, involving complexity in model scaling.

Video models need to consider more issues compared to language models — video models need to accommodate GPU memory, process large amounts of metadata, consider video duration, style, and content diversity, etc. These factors all increase video model development complexity.

Therefore, while scaling up is a development direction, it's actually not a simple thing. People may try to find a simple explanation for scaling law, but this is a massive systems engineering effort requiring comprehensive consideration of multiple factors.

Neocortex: What role do you see video generation playing in achieving AGI? Compared to language models, which provides an easier path to AGI?

Yishu Miao: I lean toward video, because while language is a carrier of intelligence containing much logic, it can't represent all intelligence. Wittgenstein once said, the limits of my language mean the limits of my world. If I have a friend lying in a hospital unable to go out, I could describe the world I see to him in language every day, but this doesn't mean he truly sees the world. There are always things that language cannot describe, and they build our unique understanding of the world — this is multimodality.

Current large language models can understand and read video, but generating video is another matter. Understanding video, using video as input and text as output, is easy because it can become a mode of expression. But perception is a much more diverse and advanced capability, an important step on the AGI path. If our AI remains at the level of logical intelligence, disconnected from the physical world, communicating with humans only in text form, I don't think such AI can be called AGI. Visual content generation is an indispensable part of the AGI path.

DeepMind Taught Us How to Allocate Resources

Neocortex: Haiper has its office in London's King's Cross area — why have so many tech companies like Google and Facebook also chosen this location?

Yishu Miao: King's Cross has indeed become a tech industry cluster. Since Google came to King's Cross in 2012, driving this trend, it subsequently attracted Meta, Uber, Waymo and others, forming a natural agglomeration effect.

Neocortex: How does London compare to Silicon Valley in AI development?

Yishu Miao: Due to DeepMind's influence, London has sufficient AI talent reserves, especially research scientists — not fewer than Silicon Valley. However, London's startup culture relatively lags behind, with graduates rarely choosing to start companies directly.

Additionally, compared to Silicon Valley, London's preferred research directions differ — the UK has a tendency to explore the relationship between science and humanity, so there's relatively strong interest in AI safety and similar topics.

However, an interesting phenomenon is that while Silicon Valley is the dream destination for many tech talents, many in London are unwilling to go to the US. Many of my colleagues have deep cultural attachment to Europe; they love the European lifestyle. At most they'll go to Paris for new opportunities, but very few will work in the US.

Neocortex: Have you considered opening an office in the Bay Area?

Yishu Miao: We've considered it, but the timing isn't right yet. We do hope to access top global talent in the Bay Area, but managing a new office requires experienced managers to plan. We probably won't open one soon, but we are indeed exploring the possibility and rationale of this option.

Neocortex: You and your other partner Ziyu Wang both previously worked at DeepMind — what did DeepMind teach you?

Yishu Miao: Ziyu and I met 10 years ago; he was my classmate at Oxford University. We belonged to the same research group at school — I worked on language models, he worked on optimization and deep reinforcement learning. At DeepMind I mainly worked on language models, which was still a very niche direction at the time — people thought language models were just for translation. Colleagues would often joke: language models are interesting, but what use are they?

DeepMind indeed taught us a lot. As a pioneer attempting to achieve AGI, DeepMind has a very complete project management and scientific research management system, clearly distinguishing between research scientist and research engineer roles. Project leaders are also very visionary, able to foresee project development, allocate necessary resources reasonably, and ensure team communication.

DeepMind didn't catch the initiative in this round of generative AI development, possibly because it didn't do as well as companies like OpenAI at "getting your hands dirty."

Neocortex: How do you divide responsibilities now?

Yishu Miao: During the pandemic, Ziyu and I reconnected and decided to do something together, given our extensive experience in multimodality and visual content generation. Currently my work focuses more on product, business, and management, while Ziyu is responsible for large model systems and fundamental research.

Neocortex: What is Haiper's current team scale in London and Canada?

Yishu Miao: The London team has 15 people, covering product, engineering, and machine learning, while the Canada team has about 6 people, responsible only for machine learning.

Neocortex: Last year, your team's development direction shifted from 3D to video generation — how did this transition occur?

Yishu Miao: The transition occurred partly based on our judgment of content, and partly related to our team's technical accumulation. In the 3D domain, we already had relatively mature technical accumulation — we were among the earliest teams to apply Neural Radiance Fields (NeRF) in 3D, and launched a user-facing iOS product.

We founded Haiper with the original intention of building an influential product that ordinary users could enjoy and benefit from technologically. But from early last year, we realized that 3D content creation and consumption both lean more toward professional users, with main application scenarios and output scenarios tending toward enterprise-facing services, such as gaming or AR/VR. For ordinary users, the barrier to 3D content creation is high, not easy to appreciate or consume. After evaluation, we believed video generation would be a competitive market — video content is closer to actual application scenarios, easier for users to consume, and closer to our ultimate goal for content generation.

Additionally, we were confident in our team's technical reserves in video. I myself have a language model background — among the earliest to work on large language models, with deep understanding of them. Regarding how to scale up, how to optimize from data to model level, expanding model scale — we have corresponding technical reserves.

Neocortex: Was there a specific catalyst for this shift? Were you inspired by some product or model on the market?

Yishu Miao: We weren't triggered by any specific event, but realized during the process of rendering 3D content to video that with sufficiently powerful video generation models, we wouldn't need 3D models. Our research also proved that 3D and 2D video can essentially be converted into each other. Additionally, we saw generation effects from similar products on the market and believed we could do better.

Neocortex: Is the underlying technology for 3D generation and video generation the same?

Yishu Miao: Both technologies' underlying approaches relate to Diffusion Model, but with different emphases. Video generation technology requires building larger models — this is unavoidable. While 3D technology doesn't necessarily require such large model scale; 3D model parameters haven't hit bottlenecks yet. However, the two technologies are essentially connected — early video generation technology had higher correlation with 3D technology, but now with rapid video generation technology development, the two have become quite different.

Neocortex: Haiper already has some commercial cooperation cases, such as JD.com and University of the Arts London. What forms do these collaborations with organizations mainly take?

Yishu Miao: First, I believe generative AI's greatest potential remains in the consumer market. Customizing relatively closed-source models for enterprises is a market that can be broken into, but it's not yet mature, because it involves a series of processes and challenges, not as straightforward as LLMs. We currently mainly provide services through API form.

Neocortex: Your clients include both e-commerce companies and universities — they seem quite scattered. What industries are actually your target customers?

Yishu Miao: We are selective when choosing partner industries, but currently we're more in a broad exploration phase emphasizing breadth. We hope to engage with different industries and explore where our models can play a role. Eventually, our partners may gradually converge to specific industries, but this process isn't planned in advance — it's formed through continuous exploration and磨合.

5Y Capital seeks, supports, and inspires lone entrepreneurs, providing support from spirit to all business operations. We believe that if the "crazy" you in others' eyes begins to be believed, the world will become refreshingly different.

BEIJING · SHANGHAI · SHENZHEN · HONG KONG