XVERSE: AI + 3D, Unlocking a More Realistic, More Intelligent Digital New World | Gaorong Future
Always standing at the cutting edge of 3D and AI technology.
XVERSE (hereinafter referred to as "XVERSE") was founded in 2021 by Yao Xing, former Vice President of Tencent and head of Tencent AI Lab and Robotics X Lab. Yao and his team have lived through the full cycles of both the PC and mobile internet eras, and during the reinforcement learning wave, led the development of star products like the Go AI "Jueyi" and the game AI "Juewu."

XVERSE was founded at a time when "mobile internet was hitting a bottleneck and beginning its transition toward the immersive internet." Gaorong Ventures co-led XVERSE's angel round and continued to participate in subsequent rounds.
From XVERSE's perspective, the so-called "immersive internet" may be a more authentic, more intelligent digital world — and there are two golden keys to this world: AI + 3D.
To that end, over the past four years, XVERSE has become the only company in China to simultaneously build both AI and 3D technologies. On one hand, the company developed proprietary technical solutions including an end-cloud collaborative 3D engine, establishing distinctive advantages in 3D production and rendering. Riding the large model wave, XVERSE independently developed large-scale open-source models, steadily accumulating expertise in innovative Mixture-of-Experts (MoE) architectures.
While keeping its eyes on the horizon, XVERSE is also unlocking present-day commercial possibilities with AI and 3D technologies. By acquiring users and driving growth through 3D interactive content production, VR large-space brands, and AI applications, it is gradually building out a one-stop platform for 3D content production and consumption.

Recently, at an AI application company joint livestream recruiting event organized by Gaorong Ventures and BOSS Zhipin, XVERSE Co-founder and CTO Xiao Zhili shared online how XVERSE is using "AI + 3D" to explore a future-facing digital new world. Xiao is an expert in cloud storage, cloud computing, and rich media storage and transmission technologies, and previously served as General Manager of Tencent's Infrastructure Department.

The following is Xiao Zhili's presentation (edited):

Why Build Both 3D and AI?
What core technical foundations are needed to support the transition from mobile internet to immersive internet? We can think about this from two dimensions: information and intelligence.
Information dimension: Looking back at internet development trends, mainstream information media have evolved from text to images to video, continuously increasing in dimensionality. Video essentially represents the experiential ceiling of mobile internet. To move closer to the real world, the viable path is to add spatial dimensions — that is, to develop toward 3D, a higher information dimension.
Intelligence dimension: For a future digital world to be sufficiently authentic, visual experience alone isn't enough; universal intelligence must also be achieved. From reinforcement learning to large language models, multimodal systems, and agents, the wave of intelligence surges forward, enabling future digital worlds with "self-growing" and "self-adapting" capabilities.
Therefore, from its founding, XVERSE established 3D (perceptual intelligence) and AI (cognitive intelligence) as its dual-track technical strategy.

On the 3D technology dimension, XVERSE develops proprietary 3D engines, sets content production benchmarks, and builds "authentic" virtual worlds. On the AI dimension, XVERSE independently develops full-chain, multi-size large models, empowering content production and entertainment consumption to create "intelligent" digital life. These two directions gradually converge and develop together, ultimately serving our vision — building an authentic and intelligent digital world for everyone.

3D Technology Innovation: From Content Production to Consumption
From the perspective of information evolution, the next step in information dimensionality increase is inevitably 3D interactive content. We know that the higher the information dimension of content, the lower the barrier for users to consume and receive it — but the higher the production cost. For example, browsing a short video is usually easier than reading an article, but the content production cost may increase geometrically.
XVERSE took the lead in innovating from the 3D content production side, lowering production barriers and improving efficiency.
To date, XVERSE has established distinctive advantages in 3D production and rendering. For example, XVERSE has engine modification capabilities and has made numerous improvements on top of UE5 (Unreal Engine 5). Among these, it independently developed an industry-leading "end-cloud collaborative" 3D interactive technology solution, enabling real-time rendering of high-definition assets on mobile devices while ensuring real-time interactivity, with high-definition effects even on low-power devices. This technology also supports XVERSE in creating numerous benchmark 3D interactive content projects for clients in culture & tourism and commercial sectors.
Additionally, XVERSE has developed multiple innovative AIGC tools to lower 3D content production barriers. In August 2024, XVERSE launched MotionGen, China's first 3D AIGC motion production model, allowing users to input simple text commands to quickly generate realistic, smooth, and complex 3D motions. The model supports various motion needs from basic walking to complex limb movements, such as quickly generating dance animations for a character.
In 3D worlds, scenes are also a key element. Rapidly reconstructing real-world scenes in virtual worlds involves numerous technical challenges. XVERSE developed XScene-UE Plugin, a 3D scene generation tool based on the revolutionary 3DGS technology. 3DGS, or 3D Gaussian Splatting, can generate extremely high-quality 3D scenes from 2D images at remarkable speed, solving rendering quality and efficiency in one go — hailed by the industry as "the future of 3D." XVERSE was among the first to adopt 3DGS just three months after its research paper release, and launched a free plugin tool.
Through this plugin, developers can perform one-stop creation for a given 3D scene: mobile phone capture, computer-generated 3DGS, editing in UE5, and interactive development. For example, taking a real-world photo of the XVERSE office enables rapid 3D reconstruction, with automatic conversion in the virtual world to standard materials, lighting information, and support for secondary processing — or allowing a virtual character to perform interactive actions in this scene, achieving "virtual-real fusion" effects.
Based on these technical accumulations, XVERSE not only achieves better results in 3D interactive content production, but also significantly reduces content production costs.
At the current stage, with head-mounted display terminals not yet widespread, XVERSE is first enabling users to encounter new content consumption formats through offline channels, achieving commercial deployment of its technology. In offline scenarios, the payment barrier for users to experience new interactive content is much lower than online. For example, last year the VR immersive exhibition The Disappeared Pharaoh sparked a wave of enthusiasm domestically. As new terminals become more prevalent in the future, the digital consumption market will be reshaped, and XVERSE's accumulated advantages can gradually be unleashed.

Starting in July 2024, XVERSE incubated VR large-space brand "VISION WALK," which has rapidly taken industry leadership with next-generation immersive experiences at its core. Within one year, it has launched over 150 VR content experience stores domestically, with overseas markets also being actively expanded. XVERSE's accumulated capabilities in "full-stack proprietary industry chain, extreme cost control, and unlimited scene adaptability" have become the underlying drivers of this growth.


AI: Multimodal and Reasoning Capabilities Driving Innovative AI Applications
Over the past two years, XVERSE has also kept pace with the large model technology wave. Its independently developed XVERSE large models, with advantages in multimodality, long context, and innovative MoE architecture, have elevated domestic open-source models to world-class levels.
In 2024, XVERSE large models achieved multiple milestones. In January 2024, it released XVERSE-Long, then the world's longest-context open-source large model. In April, it launched the visual multimodal large model XVERSE-V, giving the model visual understanding capabilities. In September, it released what was then China's largest 255-billion-parameter MoE open-source model.

XVERSE large models achieve "full-chain independent development," with multiple key technical innovations developed along the way to improve model efficiency and computational efficiency. For example, it employs 4D topological architecture and computation overlapping mechanisms to improve computational efficiency, and designs expert weighting and learning rate scheduling strategies to enhance training effectiveness.
Looking ahead at this year's large model development directions, there are undoubtedly two key trends. First, whether large language models can become more intelligent — beyond outputting text content, to completing more complex logical tasks based on agents in the future. Second, the continued development of multimodal models, including image, voice, and video models, gradually reaching usability in consumer scenarios.
XVERSE is also building AI applications based on its large model capabilities. In 2024, XVERSE launched Saylo, an AI character-playing interactive web novel application primarily targeting overseas markets. The product's DAU, user session length, retention, and paid conversion have all developed positively. The app once ranked #1 on the entertainment charts in Hong Kong, Macau, and Taiwan; and achieved outstanding rankings on local free and grossing charts in markets including the US, Japan, UK, Malaysia, and the Philippines.
As of July this year, Saylo's cumulative users successfully surpassed 3 million. Notably, Saylo's user stickiness is exceptionally strong, with average retention rates exceeding 65% and active users averaging 110 minutes of daily usage time — far above industry averages.

Behind Saylo's ability to deliver highly authentic, highly interactive user experiences lies XVERSE's accumulated large model technology. For example, XVERSE independently developed an MoE foundation model for the pan-entertainment domain, using proprietary Dense model嫁接MoE框架Post-train技术 to optimize the model's reasoning capabilities, compress parameter scale, and reduce training costs. The Saylo product has also fully integrated XVERSE's AI image generation, AI voice calling, and AI video capabilities.
In the post-training phase, XVERSE excels at combining reinforcement learning techniques with user feedback, making AI outputs more aligned with user preferences — "fun and engaging."
Beyond Saylo, XVERSE is also launching and exploring new application forms, including AI + entertainment, AI + social, and AI + gaming directions.

Always Standing at the Forefront of 3D and AI Technology
Looking back at XVERSE's entrepreneurial journey, we have always stood at the forefront of 3D and AI technology, consistently maintaining firm commitment to technology investment. Currently, XVERSE has a team of over 200 people, with R&D accounting for more than 75%.
Looking ahead, the bidirectional empowerment of AI + 3D will unlock more potential. Currently, AI technology is more applied to the generation side of 3D content; we also look forward to seeing more convergence of AI and 3D on the consumption side. For example, in real-time video chat scenarios between users and 3D virtual characters, large models could combine a character's text responses to real-time generate the 3D character's body movements, facial expressions, lip sync, and more.
Once the application path for AI + 3D on the consumption side is opened up, the possibilities will be immense. A more authentic and intelligent digital world will also unfold before our eyes more quickly.




