Jensen Huang's Vision of a "Universal Translator" — How Does It Cross the Visual Divide? | Yunqi Tech π

云启资本·November 29, 2024·6·0

The Visual Puzzle of Large Models

How can AI capabilities be truly embedded into the physical world to unlock greater technological value? Around this industry-wide question, Yunqi Capital has recently engaged with the community through various formats, including research reports and portfolio updates, to collectively piece together a fuller picture.

The allure of exploration lies in fresh perspectives and new attempts. Recently, at the IDEA (International Digital Economy Academy) Annual Forum, a discussion on vision models and embodied intelligence approached the problem from the angle of building general-purpose vision models — analyzing the obstacles that large models, what Jensen Huang calls the "universal translator," must overcome when landing in the messy complexity of the physical world.

This edition of "Yunqi Tech π" shares highlights from that discussion. May these diverse perspectives offer some inspiration at a time when consensus remains elusive.

This article is excerpted from "Xinpi Ceng NewNew Thing," originally titled "The Vision Problem of Large Models | Xinpi Ceng," by Yangyang Wu.

On November 23, Jensen Huang flew to Hong Kong for a conversation with Harry Shum at Hong Kong University of Science and Technology. This time, he notably refrained from mentioning his pet topic of "accelerated computing," instead touching on something more fundamental but also more insightful.

As someone who has been NVIDIA's CEO for over 30 years, Jensen Huang has witnessed multiple waves of artificial intelligence. He believes the essence of the last AI wave was the "Universal Function Approximator" — meaning deep neural networks could, through reinforcement learning, fit any function within the network. This technology gave us image recognition. Transformer-based generative AI, by contrast, is a "Universal Translator" — beyond translating one language into another, it can translate text into images, video, proteins, or programming languages, and vice versa. The training data required is simply sufficient pairs of text-image, text-video, text-protein, or text-code data.

The essence of a "translator" means generative AI doesn't seek answers from first principles the way humans do; it merely finds patterns from observing data and produces answers. Thus, it is only "simulating" language, "simulating" physics, "simulating" intelligence... Yet Huang believes this simulation holds tremendous value — it could fundamentally disrupt how human scientists discover new theories today. "In many scientific fields, we understand first principles, we understand the Schrödinger equation, Maxwell's equations, and many such equations, but we cannot simulate and understand large systems," Huang said. But today, if we train an AI capable of simulating extremely large-scale systems — climate systems, ocean systems, financial markets, proteins, biological systems — we can understand them better than when we couldn't simulate beyond these large systems. In other words, simulation facilitates understanding. Ultimately, the paradigm of scientific research will transition from humans learning from data to discover patterns and intelligence, to machines learning from data to discover patterns and intelligence.

Harry Shum responded by citing a University of Washington professor's post on X: "Top American universities haven't contributed that many groundbreaking papers over the past decade; the astonishing work has mostly come from top companies like NVIDIA, Microsoft, OpenAI, and Google DeepMind — partly because they have sufficient compute."

That UW professor shared this insight a few years ago, before ChatGPT's release, when artificial neural networks were still primarily "universal function approximators," not yet "universal translators." Today, machines' ability to learn from data is clearly far more powerful. Yet the discussion on vision models and embodied intelligence at the IDEA Annual Forum also shows that the boundaries of the "universal translator" are becoming apparent — particularly when entering the real physical world.

Existing data is nearly exhausted; synthetic data is the next billion-dollar problem

At the IDEA Annual Forum the day before his conversation with Huang, Shum noted that OpenAI's training of GPT-4 had nearly exhausted all effective data available on the internet — 20 trillion tokens. He estimates that training GPT-5 would require roughly 10 times that amount, a productive capacity humanity has not achieved since the dawn of civilization. If the Scaling Law (more training data makes models smarter) still holds, data must be purposefully synthesized in bulk rather than waiting for it to emerge naturally from human society.

In terms of data types, we need both "pre-training" data that keeps Scaling Law effective, and strongly logical data for o1-like models to perform reinforcement learning. Shum believes synthetic data is a billion-dollar problem.

General language models exist; general vision models do not

To date, Sora — which pioneered visual generation models — remains in testing, not officially launched. The ChatGPT moment for vision has yet to arrive. DINO-X, released by IDEA on November 22, claims to detect nearly all objects in open environments. Fei-Fei Li's "spatial intelligence" startup reportedly uses this DINO model in its workflow, having robots first detect various objects in space, then extracting their 3D structures through 3D technology. However, this model is not a general vision model — it can only identify 2D objects from prompts, not 3D.

Lei Zhang, head of IDEA's Computer Vision and Robotics Research Center, explained that general vision models are difficult to build partly because visual problems have highly diverse inputs and outputs. Language models take in and output sequences of words. Vision models, by contrast, sometimes take images, sometimes video, sometimes 2D, sometimes 3D — and output formats are equally varied: classifying images or videos, describing images, providing object coordinates, or obtaining 3D structures. "At the algorithmic level, vision models haven't achieved a truly unified form," Zhang said.

Vision's largest application scenario is embodied intelligence — applying vision models to robotics, autonomous driving, drones, and similar fields. This means 3D data isn't optional for vision models; it's essential. Yet such data is the scarcest of all visual data. Moreover, how to fuse visual data with depth sensors, tactile sensors, and other multi-dimensional data remains a challenge.

From vision models to embodied intelligence is a short distance, but collaboration with motor systems is a hurdle

Does collecting enough 3D data mean robots can achieve human-like spatial intelligence? The answer may be no. Zhang noted that current robots still need to calculate an object's X, Y, and Z coordinates to grasp something from space, plan accordingly, and only then execute the grasp through a robotic arm — the entire process is like "grabbing with your eyes closed." The human brain doesn't calculate 3D coordinates, and the human eye provides real-time visual feedback on whether a grasp will succeed. In other words, vision in embodied intelligence isn't purely a vision problem; it involves collaboration with motor systems.

"Our current vision models and language models process vision and language differently from humans," said Han Lei, head of the Intelligent Agent Center at Tencent Robotics X Lab. The industry acknowledges this, and this difference means robots may need to find their own solutions to many problems — many of which remain unsolved. Take "rapid decision-making": when a human hand touches a very hot cup, it retracts immediately, without waiting for the brain to identify what the hand is touching before deciding whether to pull back. Large models' "predict the next token" decision mode may produce "hallucinations," while reasoning models attempting to reduce such erroneous predictions (like o1) require longer "thinking" time. Han believes embodied intelligence still needs to learn from humans at the algorithmic level: some tasks require slow, deliberative "brain" thinking, while others can be delegated to the "cerebellum" for rapid response.

Mao Yinian, Meituan VP and head of the Drone Business Unit, discussed the problem of multi-sensory competition in robots. He explained that humans have multiple complementary sensory organs — eyes, nose, ears — with different priorities during task execution. When grabbing a cup, for instance, vision dominates decision-making before contact; after touching the cup, touch and other bodily sensations take over, and one can even stop looking at the cup while bringing it to drink. This modality switching is almost seamless, which robots today cannot achieve — they don't even know which sensors are reliable and which aren't.

Take drones: to handle various flight environments — sunny days, nighttime, rain, urban canyons — they already carry navigation, positioning, and velocity calculation devices ranging from satellites, radio waves, cameras, millimeter-wave radar, and LiDAR to barometers. Yet when it comes time to make final decisions, drones cannot determine which sensor is most trustworthy. Moreover, when a dominant signal disappears or is disrupted, the human brain automatically switches to another sense — entering a dark room from a bright one, one automatically shifts from visual navigation to tactile navigation by feeling the walls. Drones cannot make this switch. "They don't have this knowledge; they can't even distinguish ground from sky," Mao said.

Is end-to-end the solution?

Tesla's exploration in autonomous driving has led many to see the potential of "end-to-end" approaches — training robots on enough scenarios until they produce specific behaviors upon seeing or receiving certain signals, thereby gaining the ability to handle various situations. This model essentially treats sensor signals as one language and vehicle or robot behavior as another, with the model serving as translator between sensor language and behavioral response.

Zhang noted that over the past year, numerous robotics companies have attempted to solve robots' 3D spatial motion problems through end-to-end approaches, achieving "stunning" results, particularly in arm manipulation. Yet this path still faces two challenges. First, the perennial problem of end-to-end models: they're black boxes — when something goes wrong, no one can explain or fix it. Second, Zhang believes these models lack sufficient "generalization" — a robot learning to grasp a part of a certain shape upon seeing it cannot easily extend this capability to other domains, like doing laundry.

"Autonomous driving can do end-to-end because Tesla, for example, has massive numbers of production vehicles driving normally on roads, giving it sufficient data for that scenario," Zhang said. But this system only works for driving. After all, cars existed before autonomous vehicles; data migration had precedent. But there have never been large numbers of robots in our world before — this robot scenario data must be accumulated from scratch, orders of magnitude less than internet data.