VAST Closes New Funding Round: Is 3D the Last Ticket for AIGC?

暗涌Waves·September 18, 2024

Will 3D Be the Future?

By Jiaxiang Shi

AnYong Waves has learned that VAST, a company in the 3D generative model space, has completed a Pre-A funding round led by Fortune Venture Capital and春华创投, with Inno Angel Fund and Tsinghua Alumni Seed Fund participating. VAST stated that the round raised several hundred million RMB, setting a record for the largest funding amount in the 3D large model track. VAST also received angel funding from Oasis Capital in June last year.

"Post-95 entrepreneur," "MiniMax Employee No. 001" — these are the labels often attached to VAST founder Yachen Song. In 2018, before graduating, Song began working at SenseTime CEO's office. In 2021, he left SenseTime as its first employee to co-found MiniMax.

On September 29, 2022, Google released DreamFusion, a text-to-3D technology that used pretrained 2D text-to-image diffusion models and proposed the SDS (Score Distillation Sampling) method, completing open-domain text-to-3D synthesis for the first time. VAST CTO Ding Liang said this marked the moment when 3D AIGC truly became applicable and potentially viable as real projects.

At the end of 2022, Song left MiniMax and founded VAST the following March. Unlike many AI entrepreneurs who chose the application-plus-community direction, Song opted for the most difficult path: "self-developed models plus applications." Currently, 80% of the team's expenses go to technical staff and computing power.

As one of the "Four AI Dragons," SenseTime has reportedly produced the largest number of employees who went on to found large model startups, including MiniMax, Infinigence-AI, RightBrain Tech, Yantu Intelligence, and VAST. VAST's CTO Ding Liang was also previously head of SenseTime's general models.

However, the SDS method for generating 3D models had flaws, such as producing "multi-headed" models and slow generation speeds. Therefore, VAST ultimately decided not to pursue the 2D optimization route, instead choosing a 2D-3D fusion approach. This meant requiring large amounts of high-quality 3D data.

Unlike text, images, video, and other consumption formats, 3D data is extremely scarce, found almost only in games or films. As a startup, it was also completely unaffordable to pay tens or even hundreds of dollars for a single model. "You just have to wheedle and cajole to get it," Song told AnYong Waves.

Starting in March 2023, he spent three months gathering data from various "nooks and crannies" (such as 3D modeling training courses), and also established partnerships with companies in gaming, animation, film and television, modeling, communities, and databases.

Twenty years ago with PCs, or ten years ago with mobile internet — across both waves, the most profitable ventures were always content platforms for different information carriers. "Text, images, video, even sound all have their own content platforms." In Song's view, the reason a 3D content platform never emerged was that the barrier to creation was too high — "it's still at the stage of writing with a brush pen" — and creators couldn't make the ROI work. Now, Song says, the cost of producing 3D content has approached zero.

"Before Douyin and Kuaishou exploded, there had to be something called a smartphone camera." By their analogy, the self-developed 3D large model Tripo is that smartphone camera — aimed at ordinary users, capable of generating 3D models directly from text and images.

Startups are always asked how they compete with tech giants. Song's response: "I'll first ask, one, do they play games? Do they really want to enter the virtual world? If you told them to not go to work and just stay home playing games in XR glasses, would they be willing? If they'd go crazy after a month, then sorry, they're fundamentally different from us."

VAST's entrepreneurial drive stems from Song's fanatical love of games and anime. "I'm one of those post-90s kids poisoned by the electronic heroin." At SenseTime, Song saw graduates from China's top eight art academies who were genuinely talented, but their energy was consumed by endless modeling. He hopes VAST can liberate creative talent from "labor-intensive" industries.

Another star entrepreneur in AI-generated 3D is Yuanming Hu. A graduate of Tsinghua's Yao Class with a PhD from MIT, he is a well-known computer graphics researcher and author of the "Taichi" programming language. His most famous demonstration was recreating Frozen in 99 lines of code.

Last November, Hu announced his new venture Meshy, dedicated to 3D generative models. It has now iterated to its fourth generation, also capable of text-to-3D and image-to-3D generation.

However, Hu believes that while AI-generated 3D has dramatically lowered the barrier to 3D content creation, enabling mass 3D creation, the use cases for 3D assets are still not mature enough.

In fact, when MiniMax was first founded, it simultaneously worked on language, voice, and vision models, hoping its agents would have voice, appearance, and text capabilities. But MiniMax quickly abandoned 3D avatars because internally they believed it couldn't scale. Previously, the only industries that could support 3D were gaming and film, which typically require years of R&D cycles.

"At the same time, I realized using deep learning for 3D was the wrong approach. On the current carrier — smartphones — if a 3D person keeps staring at you, that's inherently strange. Most of the time, interaction doesn't actually need a real physical form," said MiniMax founder Junjie Yan in an interview.

The best outcome, of course, would be that as devices like Vision Pro and Quest 3 become more widespread, demand for 3D gradually increases, and the productivity gains AI brings to 3D creation happen to meet that emerging demand.

However, even giants like Meta and Apple have struggled to break through in this space. Since late 2020, Meta's VR division has accumulated losses of approximately $50 billion. After a brief surge following its launch, Vision Pro shipments also fell significantly short of expectations. In mainland China, Tencent and ByteDance announced XR department layoffs in February last year. Demand for 3D remains unproven.

A recent positive development may be that "AI godmother" Fei-Fei Li announced her new company World Labs, a spatial intelligence enterprise dedicated to building complete worlds using physics, logic, and rich real-world details. World Labs' founding team told Wired that in their first phase, they will build a model with deep understanding of three-dimensionality, physical properties, and spatial and temporal concepts, with the next phase supporting AR. Meanwhile, World Labs co-founder Ben Mildenhall is also an author of DreamFusion.

For anxious investors, when the "C" in AIGC — text, images, and video — has been swept clean, with valuations exceeding $3 billion, AI-generated 3D companies with still-reasonable valuations may be their last ticket aboard.

Image source | Unsplash

Layout | Yao Nan