35 Days Until Robots Enter Your Home: "Self-Variable Robotics" Unveils World's First Unified World Model | Yunqi Capital Portfolio
AI Goes Deep Into Real Life

A slipper kicked under the couch. A cat knocking over a glass of water. For humans, these are trivial matters. For robots, they remain formidable hurdles.
On April 21, X Variable Robotics — an early Yunqi Capital portfolio company — unveiled its "new solution" for bringing robots into homes: robots equipped with the next-generation embodied intelligence foundation model WALL-B will enter real user households in 35 days. The model is the world's first embodied foundation model based on the World Unified Model (WUM) architecture, potentially overcoming the limitations of traditional VLA architectures when facing complex consumer scenarios.
This edition of "Yunqi Partners" takes you inside the details.
The following is adapted from X Variable Robotics
On April 21, X Variable Robotics held a product launch, introducing its next-generation robot-for-the-home initiative. In one month, its robots will be powered by the company's self-developed next-generation embodied intelligence foundation model WALL-B. This is the world's first embodied intelligence foundation model built on the World Unified Model (WUM) architecture, marking a major leap for embodied foundation models from VLA architecture toward native multimodal fusion architecture.

X Variable founder and CEO Wang Qian and co-founder and CTO Wang Hao delivered a comprehensive breakdown of WALL-B's technical architecture, data strategy, and training mechanisms, announcing that in 35 days, the next-generation WALL-B-equipped robots will enter real households for the first time, embarking on their journey to serve family life.
The Home Is Embodied Intelligence's True Proving Ground
"Seven a.m. The alarm goes off. You get out of bed and walk to the living room. Your slippers are nowhere to be found. The kitchen dishes are unwashed. Your kid's backpack is on the floor. The cat has knocked over a glass of water." Wang Qian opened with this everyday scene, vividly revealing the essence of home environments — random, fragmented, constantly shifting. Currently, no robot on Earth can independently complete the integrated tidying tasks in this scenario without remote operation.
This reality contrasts sharply with public perception. Backflips, breakdancing, calligraphy — these robot demos on stage pack visual punch, but they are essentially "command-line robots" running preset trajectories, with every move pre-programmed or remotely controlled. Industrial robots already deployed in factories don't offer a useful comparison either: in a factory, one action may repeat ten thousand times under identical conditions; in a home, ten thousand actions may each happen once, under different conditions every time.

"The hardware is already there — bipedal locomotion, dexterous hands, force-controlled joints. But the brain hasn't caught up. The core bottleneck for current robots isn't the body, it's intelligence. Every second in a home environment can bring a novel event: when the cat jumps on the table, where the child throws a toy, how carpet friction differs completely from lab flooring. Existing technology cannot handle this randomness and fragmentation, and getting robots into homes is regarded as 'one of the hardest technical problems of our era'."
From WALL-A to WALL-B: VLA Architecture's Limits and Breakthrough
From its founding, X Variable Robotics has focused on building the "brain" for robots — an end-to-end embodied intelligence foundation model. In late 2024, the company released its first-generation embodied foundation model WALL-A, based on the VLA (Vision-Language-Action) architecture. In September 2025, it open-sourced WALL-OSS, a lightweight version using the same architectural approach.
On the application front, X Variable partnered with 58.com to deploy WALL-AS-equipped robots into real households, working alongside cleaning professionals — achieving the world's first robot entry into homes serving complex domestic life, and the first large-scale C-end deployment in complex environments.
These real-world deployments revealed the "ceiling" of VLA architecture to the team. Wang Hao explained that VLA is essentially three independent modules stitched together: a vision module for object recognition, a language module for instruction understanding, and an action module for trajectory generation.
Data passes sequentially through these modules, with information loss and latency at every boundary. More fundamentally, VLA models can only imitate trajectories seen in training data, without truly understanding the laws of the physical world. "It doesn't understand why a cup falls, or why a plate teetering on a table edge needs to be pushed back. It's just repeating what it's seen."
WALL-B is the response to this impasse. It is not the next version of WALL-A, but a complete rewrite from underlying architecture to training paradigm.
World Unified Model (WUM): From "VLA" to "Unified Whole"
What truly distinguishes WALL-B from other industry solutions is its architectural revolution from VLA to WUM.
The design philosophy parallels Apple Silicon's unified memory architecture: before the M1 chip, Mac CPUs, GPUs, and memory were separate, with data movement creating latency and waste that bottlenecked performance; Apple unified all processing units to share the same memory pool, dramatically boosting performance.
In robotics, VLA resembles the pre-M1 laptop architecture — vision, language, and action modules each doing their own thing, data shuttling between them, losing information with every transfer. Rich visual learning gets reduced to a blurry summary by the time it reaches the action module.
WUM, as adopted by WALL-B, applies the same core idea — putting vision, language, action, physical prediction, and all other capabilities into the same network, jointly trained from scratch as an integrated whole, eliminating inter-module boundaries and data movement waste.

X Variable co-founder & COO Yang Qian presenting simultaneously at the Shenzhen satellite venue
Based on this architecture, WALL-B achieves three core technical characteristics that distinguish it from existing industry models:
First, Native Multimodality
WALL-B from day one of training synchronously annotates and jointly trains on multimodal data spanning vision, audio, language, touch, and action, achieving "multimodal in, multimodal out." This means the model doesn't need to "pass messages" between modules — it prepares to reach as it sees the cup, adjusts force as it feels weight.
This architecture also endows the model with a capability called "native proprioception" for the first time: WALL-B can intrinsically perceive its own spatial dimensions — height, width, arm reach — and judge whether it can pass through a space or reach an object, without continuous self-observation or heavy reliance on external sensors. This is an endogenous spatial awareness, not obtained through external measurement or modeling. Wang Hao noted that even many animals lack this capability.
Second, A "Worldview" of Physical Reality
WALL-B perceives and predicts fundamental physical laws including gravity, inertia, friction, and velocity. In never-before-seen scenarios — such as a plate half-hanging off a table edge — the model can infer it will fall and break, then take preventive action.
This understanding of physical laws provides the foundation for zero-shot generalization. Physical laws remain consistent across environments; in any home it's never visited, WALL-B can leverage its understanding of basic physical commonsense to handle new situations, without retraining for each household.
Third, Interacting with the World and Self-Evolving
**This is the most fundamental characteristic distinguishing WUM architecture from all existing VLA models. Currently, mainstream robots typically halt on task failure, returning error messages without learning from failure. WALL-B behaves differently: after failure, it adjusts strategy and retries; if successful, it directly updates this successful experience into its model parameters.

This mechanism enables the model to self-iterate in real environments, without engineers retraining, without manual data injection, without returning to the lab. Wang Hao compared it to humans learning chopsticks — dropping them countless times, but each failure adjusts hand control until stable skill forms. WALL-B overcomes the Transformer architecture's difficulty with long-term internalized memory; all experience is self-updated through human-memory-like mechanisms in native multimodal memory form.
Data Strategy: From "Sugar Water" to "Milk"
Currently, most industry training data comes from laboratories: fixed lighting, fixed object positions, disturbance-free environments. Wang Hao likens such lab data to "sugar water data" — clean, controllable, abundant, but significantly divergent from the real world, especially from homes with ever-changing natural light, casually placed objects, and random movements of children and pets. Models trained on such data rapidly fail in real environments.
By contrast, what Wang Hao calls "milk data" — noisy, variable, randomness-filled data from real home environments — is the path X Variable has chosen.

To obtain such data, the X Variable team entered hundreds of volunteers' real homes for model training. Every household differed in layout, lighting, object placement, and degree of mess. Some had floors scattered with slippers, delivery boxes, toys, and socks; some had cats suddenly jumping on tables; some had warm-toned kitchen lighting and cool-toned living rooms. These variables cannot be simulated in labs, yet they are everyday reality in homes — precisely the real conditions models must learn to handle.
In summary, X Variable's data strategy can be described as: lab data for foundation, real scenarios for quality. Lab data builds basic capabilities — recognizing common objects, executing basic actions; real home data teaches the model to survive in uncertain environments. A data flywheel driven by genuinely random, unpredictable real-world data is the true moat.
Next-Generation Robots Enter Real Homes in 35 Days
As robots enter homes, privacy concerns cannot be ignored. Wang Qian presented the X Variable team's clear solution:
Visual Desensitization
The robot performs real-time blurring of raw images on-device; raw images never leave the device, and what the robot "sees" is already scene data with personal characteristics removed;
Transparent Authorization
The device only powers on after the user actively presses consent; there is no "default consent" — if the user disagrees, it doesn't turn on;
Purpose Limitation
Never shared with third parties; the robot recognizes only one owner, and locks immediately upon detecting suspicious commands.
"Promises are cheap; user trust is the most expensive thing," Wang Qian stated.
On commercialization, X Variable's timeline is set: in 35 days, next-generation robots equipped with WALL-B and hardware upgrades for home environments will enter the households of first-batch users. Wang Qian noted the current model remains at the "intern" stage — it makes mistakes, needs remote assistance, and might put slippers in the kitchen or pause mid-table-wiping to "think." But it can work 24 hours uninterrupted, and grows "smarter" every working day through newly generated data.
Starting today, X Variable is recruiting "parents" for its first batch of home-entering robots; users can apply through official channels.

Continuous advancement of embodied foundation models has been X Variable's pursuit since founding. Building a robot brain that can truly understand the world and continuously learn within it, entering homes to serve humanity every day, is X Variable Robotics' long-term vision.
"The robots entering homes now are still clumsy, moving slowly, often making mistakes. So too was humanity's first step as infants. Every great journey begins with a staggering first step. Today, robots have already begun their journey of learning and evolution in the most complex place of all."
The model continues to iterate; full WALL-B details and its ecosystem foundation will be unveiled on April 27 at the first Guangdong Province AI Application Matchmaking Conference in Shenzhen.






