Product

Vision-Language-Action

VLA

Vision-Language-Action (VLA) is a model architecture that connects visual perception, language understanding, and physical action for robotics and embodied AI. As described by Yunqi Capital portfolio companies, it represents a direct-action paradigm where models translate what they see and what they're told into movement, as distinct from world-model approaches that add an intermediate layer of "imagining" consequences before acting .

The architecture has become central to commercial autonomous driving and robotics development. DeepRoute (元戎启行) has built its assisted-driving platform on VLA, with mass-production deliveries surpassing 30,000 units monthly as of September 2025 and nearly 200,000 vehicles expected on roads by year-end . In robotics, companies like Astribot and Stardust Intelligence (星尘智能) are pushing VLA toward general-purpose physical AI. Astribot's CLAP framework trains VLA models to learn from both robot trajectory data and unlabeled human videos, using cross-modal alignment to bridge the gap between human demonstration and executable robot action . Stardust Intelligence's Lumo-1 takes a three-stage approach: first building embodied visual-language understanding, then cross-robot joint training, finally grounding reasoning in real-world manipulation trajectories collected from its S1 cable-driven robot .

The field's intellectual lineage traces to Google's RT-1/RT-2 models, which trained end-to-end on internet-scale vision-language tasks before robotics fine-tuning . Current Chinese development, as 五源资本 noted in mid-2026, still evaluates these systems through relatively crude benchmarking—"the way Edison tested filaments"—with the core challenge being multi-step reasoning about how actions change the physical world .

AI-generated — may contain errors, please verify.

Vision-Language-ActionProduct
VLA
No graph yet
Mentioned in 6 articles

Coverage