Before Sora, Had AI Already Achieved "Physical World Simulation Freedom"? | Yunqi Tech π

云启资本·November 25, 2024·13·0

Pushing Open the Door to Physical AI

Two years into the generative AI wave, the transformative impact of AI on the digital world needs no further elaboration. How to move AI beyond the digital realm and help people solve more problems in the physical world? This is becoming the new focal point for the industry.

Sora, unveiled earlier this year, was positioned by OpenAI as a "world simulator" — designed to enable AI models to understand and simulate the physical world in motion, thereby generating video content that conforms to physical laws.

But exploration into simulating the physical world began even earlier in AI. Over a decade ago, when the intersection of AI and 3D spatial design was still in its infancy, Yunqi Capital portfolio company Manycore Tech embarked on its journey to simulate the world, developing through continuous iterations of rendering and simulation technology the 3D home design platform Kujiale, which now boasts nearly 80 million monthly active users.

Recently, Manycore Tech publicly unveiled for the first time its two technical engines that bridge the physical and digital worlds — the Qizhen (rendering) engine and the Matrix (CAD) engine. In this edition of Yunqi Tech π, we explore how this unicorn company is breaking down barriers between the two-dimensional digital world and multi-dimensional physical space, pioneering the free simulation of the physical world.

This article is republished from "Silicon Star Pro" with the original title "What Gives This Chinese Company the Confidence to Call Itself a True 'Physical World Simulator'?" Author: Yoky

In 2024, the AI field is witnessing an interesting inflection point.

OpenAI's pace of progress has visibly slowed. GPT-5 remains elusive, "Scaling Law" has become something of a pipe dream, and even Sora — the video generation model that stunned the industry earlier this year — failed to deliver on its promise of "full open access."

Behind this phenomenon lies a deeper problem: large models trained on Internet data are hitting their cognitive boundaries. Simply stacking parameters and expanding training data no longer yields qualitative breakthroughs. Meanwhile, embodied intelligence, wearable devices, and the renewed attention on AR/VR technologies all point toward a common direction: AI must establish tighter connections with the physical world.

From Internet to World, from AI to Physical AI — this pivot marks a new phase in AI development. Physical AI is not merely superficial imitation of the real world; it requires embedding the fundamental laws and true characteristics of the physical world into the foundational design of AI systems. Its ultimate goal is to build a "multi-dimensional physical world simulator," far more complex than Sora's two-dimensional video generation.

Physical AI's relatively slower development stems not from lack of importance, but from sheer difficulty. First is the scarcity of physical world data — collecting real-world physical data is costly and challenging. Second is the fundamental difference in algorithmic paradigms — simulating geometric relationships, light propagation, mechanical laws, and other physical phenomena requires more than simple imitation of human neural networks. Finally, there are the massive computational resource demands, placing even greater strain on existing computing power.

Yet despite not yet entering mainstream discourse, Physical AI is already making tangible impacts across industries. In computer vision, it helps autonomous vehicles understand real road environments. In industrial manufacturing, it enables robots to execute complex tasks with greater precision. In the metaverse, it is building virtual spaces that conform to physical laws.

All of this suggests that Physical AI is not merely the next technology trend, but a bridge between reality and the digital world. On this imaginative track, some companies have already begun demonstrating unique technical accumulation, weaving the laws of the physical world into AI's future landscape.

The ImageNet of 3D

Anyone familiar with AI knows ImageNet. The emergence of this dataset was like a bombshell, completely transforming the trajectory of computer vision development.

In 2009, AI had reached a critical juncture for image recognition. Fei-Fei Li and her team at Stanford University keenly recognized the importance of datasets. They launched a project called "ImageNet," collecting images from the internet and annotating them manually. This massive undertaking ultimately encompassed over 14 million images across more than 20,000 categories, made freely available to all developers — laying the foundation for image recognition and classification technology.

ImageNet's influence far exceeded expectations. It not only provided training data but also spawned the famous ImageNet Challenge. The 2012 competition became a turning point in AI history.

Professor Geoffrey Hinton, who won this year's Nobel Prize in Physics, along with his two students, developed AlexNet — a model based on convolutional neural networks (CNN). This innovative model catapulted image recognition accuracy from around 75% to 84%, igniting the deep learning revolution. This breakthrough sent shockwaves through academia and industry, marking AI's formal entry into the image era.

Six years later, in 2018, a project that could be called "the ImageNet of 3D" was quietly born. The Computer Robot Vision Lab at Imperial College London partnered with a Chinese company to launch InteriorNet, a deep learning dataset for indoor scene cognition. It included 16 million sets of pixel-level labeled data, 15,000 sets of video data, totaling approximately 130 million images — for training and testing AI systems' visual recognition and understanding capabilities in indoor environments.

This was the world's largest indoor scene dataset to date, and it was fully open-sourced.

Surprisingly, the Chinese company involved in this breakthrough project was not a household name: Manycore Tech. But you would certainly recognize its 3D spatial design product: Kujiale.

Based on high-performance computing rendering of the physical world, Manycore Tech's platform has accumulated massive design solutions and over 320 million 3D models, naturally containing complete three-dimensional spatial information and recording designers' professional understanding of space. More importantly, the design solutions and product materials continuously created by its large user base provide Manycore with an endless stream of data sources, while ensuring data accuracy and diversity.

Why create a "3D version of ImageNet"? Rui Tang, Chief Scientist at Manycore Tech and head of Kujiale's KooLab, told us: "Once we had large amounts of spatial data, we began thinking about whether it could be applied in other R&D scenarios. Our collaboration with Imperial College London explored applications of spatial data in UAV flight simulation experiments. Since then, based on our physically correct rendering engine, we've applied spatial data to embodied intelligence and other cutting-edge technologies — this is what Manycore's spatial intelligence platform SpatialVerse does."

What makes indoor scene data so scarce? Compared to two-dimensional images, completely describing a three-dimensional object requires handling exponentially increasing complex parameters: geometric relationships, material properties, spatial positioning, and more. Even describing an ordinary chair requires precisely recording each component's dimensions, shape, the optical characteristics of its materials, and multidimensional information including the physical distance between the chair and table. Traditional data collection methods are not only costly but also face numerous restrictions around privacy protection and legal compliance, making acquisition of high-quality 3D data an industry-wide challenge. More challenging still, indoor spatial physical data collection faces severe tests of privacy protection and regulatory compliance.

However, while these scenes are physically correct, they remain distant from real living conditions. For example, on the Kujiale platform there are numerous home design solutions where tables and living rooms are perfectly tidy — but in real life, a living room might have toys scattered about, household waste. Manycore's spatial intelligence platform addressed this by rendering authentic living elements into design scenes, bringing these virtual spaces closer to real-life conditions. For example, previously a robot vacuum might detect cat feces only through collision — with unfortunate results. Now, through pre-training in virtual space, it can accurately identify cat feces and avoid them.

These advantages become apparent when Manycore provides customized data services to embodied intelligence, large model and AIGC, and AR/VR companies.

"The Wall-Breaker"

So the question becomes: to map real physical data into the digital world for manipulation requires breaking through the "dimensional wall" between them.

The spaces where humans exist contain vast amounts of physical data. We can translate them into languages machines can understand. Manycore's Matrix (CAD) engine, alongside its self-developed billion-parameter multimodal CAD large model, can translate, make compatible, and enable data flow for design data generated from or existing in the physical world.

Imagine when an architect draws a wall line on a blueprint — this line represents not just a simple geometric shape, but contains information about wall thickness, material, position, and more. The multimodal CAD large model's reverse parsing engine acts like an experienced engineer, accurately identifying each element in the blueprint, understanding their relationships, and converting this unstructured information into three-dimensional structured data that computers can understand.

But merely understanding individual CAD drawings is far from sufficient. In actual engineering, design information often exists in multiple forms: 2D image drawings, 3D models, design specifications, even construction standards. This is where the multimodal CAD large model's support becomes essential. Compared to language large models' vague and uncertain descriptions of space, the CAD large model enables more accurate and structured representation of space.

This "brain" can simultaneously process multiple forms of input, extract key features, learn design rules, and ultimately unify all information into standardized digital expression.

After the multimodal CAD large model translates unstructured data from the physical world and generates three-dimensional structured data in the digital world, the other two core technologies of Manycore's Matrix (CAD) engine — the geometric parameterization engine and BIM engine — serve as another bridge between digital and physical worlds. After design completion, they reverse-convert to unstructured data, further generating renderings, construction drawings, and more to guide construction and production.

Simply put, this completes the full cycle from reverse parsing of the physical world to the digital world, then forward modeling from the digital world back to the physical world.

Abstracting this capability also corresponds to a form of human intelligence: high-quality compression and feature extraction of complex physical world data, stored in computers, while maintaining the ability for reverse reconstruction.

The core algorithms for data processing, combined with large numbers of real-world application scenarios, enable Manycore to accumulate the largest spatial dataset.

"Do You Believe in Light?"

That machines can read the physical world depends on technology's structural processing of data. But for humans to read the digital world, the fundamental difference is: the digital world has no light.

All "visualization" in the physical world is essentially the result of photon movement — combinations of reflection, refraction, scattering, and a series of physical reactions. Without light, we cannot judge a chair's shape, color, position, or its mechanical relationship with other objects.

Presenting a chair is, fundamentally, reproducing the photon distribution in a specific space. In digital design, this process is called photorealistic rendering. Manycore Tech's rendering engine is built precisely on this principle, achieving physically correct visual effects by precisely calculating light's propagation path through space and simulating different materials' optical characteristics.

But rendering is only part of the technical system. During the design process, the system needs to simultaneously handle problems across multiple physical dimensions. A furniture design must satisfy not just visual aesthetics but also conform to mechanical principles. This is reflected to some degree in Manycore's products: the system can perform mechanical analysis on designs during the design phase, promptly identifying structural issues.

In achieving photorealistic rendering, Manycore Tech's rendering engine employs Physically Based Rendering (PBR). The core of this rendering technology is solving the rendering equation, simulating real-world lighting effects by calculating the interaction between light and object surfaces. When processing each material, the system considers its micro-surface structure, including surface roughness, metallic properties, and other physical attributes, thereby accurately reproducing the material's reflective characteristics.

Especially when designers adjust indoor lighting, Manycore's Qizhen (rendering) engine, based on ray tracing technology, can simulate optical phenomena in virtual scenes as they would occur in the physical world — including reflection, refraction, scattering, and more — thereby delivering rendering effects comparable to the real world, making creators' work more realistic. Moreover, with AI technology enhancing photorealism in lighting, color, and other elements, the Qizhen engine has overcome traditional renderers' difficulties in rendering organic materials, and can render 99% of materials found in the physical world.

The core technological advantage stems from data accumulation. Because large amounts of data originate from the physical world and are applied back to the physical world, this process naturally completes one of the most important aspects of Physical AI: physical correctness.

Compared to currently market-highlighted generative AI products like Sora, Manycore's solution demonstrates clear advantages in physical correctness. The root of this difference lies in the nature of training data: Sora primarily relies on two-dimensional video data, which while visually rich, lacks essential connection to the physical world. The frequent physical errors in its generated content — such as unreasonable object motion or material representation — are direct manifestations of this limitation.

By contrast, what Manycore Tech has accumulated over years is complete three-dimensional data, encompassing geometric information, physical parameters, material properties, and multiple other dimensions. This data is not only validated by professional designers but more importantly maintains close ties to actual engineering practice. Combined with dedicated physics engines, this data supports a physical world simulator that comes closer to reality.

"When 4 GPUs Changed the World"

In Hinton's story, the reason for the global sensation was twofold: first, the cliff-like improvement in recognition accuracy; second, that AlexNet achieved this using just 4 NVIDIA GPUs, defeating Google's cat — built with 16,000 CPUs. This achievement shocked academia and industry, completely altering deep learning's trajectory.

People discovered that in specific computing scenarios, efficiency of compute usage far outweighs scale.

In theory, photons existing in space are infinite, as complex as human brain neurons. Therefore, correctly reproducing the physical world similarly places extremely high demands on underlying compute power. Whether one can simultaneously render 10 photons, 100 particles, or 10,000 particles determines rendering speed, depending on parallel computing efficiency.

Manycore's story similarly began with "unlocking" GPU compute power. The three founders had long focused on computer graphics, high-performance computing, and related directions. Before founding the company, Xiaohuang Huang had worked at NVIDIA on CUDA development. It was then that a book titled Physically Based Rendering: From Theory to Implementation sparked his curiosity about exploring Physical AI.

But at that time, in the step of reproducing the physical world, average rendering output speed was around 1-2 hours, with a single image costing as much as 1,000 yuan.

Could a lower-cost cloud GPU cluster, using inexpensive graphics cards, achieve commercial supercomputing performance, driving down the price and time cost of "rendering" — or even achieve better rendering effects?

In 2011, Manycore's founding team assembled a high-performance GPU cluster using low-cost graphics cards in an edge-cloud collaborative architecture, and through optimizing compute resource scheduling strategies, substantially improved GPU utilization rates. This dramatically reduced compute costs while achieving faster calculation speeds.

Across Manycore's four generations of rendering engine updates, the first generation reduced rendering time from hours to minutes through basic parallel optimization and rendering pipeline reconstruction. The second generation enhanced photorealistic rendering capabilities.

The third generation achieved cloud real-time and ray tracing, allowing designers to create in real-time environments through self-developed algorithms and dynamic load balancing. The fourth generation unveiled today fuses rendering and AI, achieving substantial improvements in rendering speed, realism, versatility, and intelligence.

From hours to seconds to real-time, reducing output costs from 1,000 yuan to free — this is essentially the result of further improving computational efficiency while guaranteeing computational quality.

The team's deep CUDA development background also brought Manycore unique advantages. They intimately understood GPU architecture characteristics, enabling bottom-level optimization of computational efficiency. For example: optimizing memory access patterns to reduce data transfer overhead; intelligent task allocation to improve GPU core utilization; reconstructing computation pipelines to maximize parallel computing effects. These technical accumulations give Manycore clear performance advantages when handling complex rendering tasks.

For instance, through intelligent analysis of scene complexity and advance planning of compute resource allocation, GPU utilization efficiency is substantially improved. The system can dynamically adjust rendering strategies based on different scene characteristics, maximizing computational efficiency while guaranteeing quality.

In industrial design, real-time rendering technology can be used for product prototype verification, substantially reducing physical sample production costs. In architectural design, it supports real-time scheme adjustment and effect preview, improving design efficiency. In virtual reality, real-time capability is the foundation for immersive experiences.

With computing capability, practical application scenarios, and long-term data accumulation in hand, a new space of imagination is opening up.

Just as 4 GPUs changed the trajectory of AI development back then, the door to Physical AI is now being pushed open.

Conclusion

In 1993, Jensen Huang co-founded NVIDIA with partners Chris Malachowsky and Curtis Priem. At its founding, they aimed to provide high-performance graphics processing solutions for the personal computer market.

As premium gaming developed, high-definition effects and cool animations drove ever-increasing demand for graphics processing power in the video game industry. In 1999, NVIDIA launched the GeForce 256, the world's first product defined as a "GPU" (Graphics Processing Unit). GPUs could handle complex 3D graphics tasks, and were solely for improving game visual effects.

In 2006, Jensen Huang began pushing NVIDIA to develop the CUDA platform, enabling developers to leverage GPU's powerful computing capabilities for various complex computational tasks.

With everything in place, the AI wave arrived.

As deep learning algorithms developed, demand for computing power increased dramatically, and GPU parallel processing capabilities became the natural choice for AI research and applications. Subsequently, GPUs were applied across gaming, professional visualization, autonomous driving, cloud computing, large models, and multiple other technology domains. NVIDIA continuously launched RTX (real-time ray tracing technology) and DLSS (deep learning super sampling technology), further enhancing graphics processing and AI application performance.

Opportunities of an era always favor those who are prepared.

For Manycore, perhaps its moment has arrived as well.