Billions of years ago, living organisms built the first "world model."

峰瑞资本·April 17, 2026·21·1

How Far Are We From a World Model, Now That We've "Uploaded" a Fruit Fly Brain?

A fruit fly brain has been "uploaded." In March 2026, San Francisco-based startup Eon moved an adult fruit fly's complete brain into the virtual world — 140,000 neurons, over 50 million synaptic connections, replicated one-to-one in the MuJoCo physics engine. No reinforcement learning, no behavioral programming, and the digital fly simply started walking on its own. It turns, it feeds, it grooms its antennae, matching real fruit fly behavior with 95% accuracy. No one "taught" it to do any of this. The behavior emerged directly from the neural structure.

Digital fruit fly. Image source: Eon, "We've Uploaded a Fruit Fly"

This experiment cuts deeper than the question of "digital life." Embodied intelligence is one of the hottest sectors today — humanoid robot funding keeps hitting new highs, and physical AI is on the lips of Jensen Huang and every investor. But the industry has hit a wall: large language models don't understand the physical world. An LLM can discuss gravity, but it has never felt gravity. It can describe "a cup falling off a table will break," but it has never experienced gravity, nor can it predict what happens when an object it's never seen before falls. Turing Award winner Yann LeCun left Meta in late 2025 after 12 years, founding AMI Labs with over $1 billion in funding, betting on one thing: LLMs are a dead end; we must pivot to world models.

LeCun put it bluntly: the reason we still don't have a household robot as agile as a house cat, and no true self-driving cars, is exactly this. Current VLA (vision-language-action) models are still imitating patterns from training data, lacking causal understanding of the physical world and incapable of long-term planning.

So how do we actually build world models? More data? 3D modeling? Physical simulation?

Just as AI development was inspired by the neural networks of the human brain, we look to biology for answers: those tactile receptors buried beneath human skin, refined over billions of years of evolution. Starting from the tactile system, we trace through Fourier transforms, neural signal transmission, and world models, offering a complete framework. This cross-disciplinary approach to problem-solving is also what FreeS Fund has consistently championed.

The perceptual systems that biology evolved over billions of years are themselves functional "world models." Understanding their structure may help us see where world models currently get stuck, and where they evolve next. Biology, medicine — disciplines that seem unrelated to AI — may be exactly the reference frame we need to understand world models.

In this report, we explore the following topics:

Why did biology make perceptual systems so complex?
Why can large language models pass the bar exam, yet fail to control a robot steadily holding a cup of water?
AI is so smart now — could it become conscious?
From protein folding to fruit fly behavior, does the principle "structure determines function" still hold in the macroscopic world?
What does billions of years of biological evolution tell us about building "world models"?

Interactive Giveaway:

Do you think AI will develop consciousness? Share your thoughts in the comments. By 17:00 on April 23, 2026, the two most thoughtful commenters will receive a copy of A Brief History of Intelligence.

A Brief History of Intelligence

By Max Bennett

Translated by Lin Qiaojin

China Publishing House

China Translation & Publishing Corporation

The Precision Instrument Buried Beneath the Skin

Among the five human senses, touch is probably the most overlooked. Vision has 4K displays, hearing has Hi-Fi speakers — but when was the last time you seriously thought about "touch"? Touch is like air: always present, rarely noticed. In fact, touch is biology's oldest perceptual system; the fetus develops touch first in the womb, and it comes with a remarkably complex "collaborative system." Beneath your skin lies an entire array of precision sensors. Mechanical touch alone has four core receptors, each with its own role.

Merkel discs have high resolution and handle fine tactile discrimination — this is what blind people use to read Braille.
Meissner corpuscles sense low-frequency vibration; when you hold a cup and judge how much force to use, whether it might slip, this is what's working.
Pacinian corpuscles sense high-frequency vibration; animals detecting distant ground tremors, judging whether danger approaches — this deep, skin-buried receptor handles that.
Ruffini endings sense skin stretch and tension, helping you perceive hand posture.

Using genetic tools and antibody staining techniques, the Ginty Lab captured fluorescent portraits of tactile neurons. Image source: Quanta Magazine, "Touch, Our Most Complex Sense, Is a Landscape of Cellular Sensors"

These four are the leads, but beneath the skin there's a full supporting cast: receptors for cold, heat, pain, itch. Notably, cold receptors are also sensitive to menthol, which is why mint feels "cool." Pain splits into fast and slow: touch something hot, and you first feel sharp pain (Aδ fibers, 30 meters per second), then a wave of burning (C fibers, 0.5–2 meters per second). The former makes you pull away; the latter makes you scream.

One type deserves special mention. C-tactile fibers wrap around hair follicles; they don't sense pressure or temperature, but emotion. The comfort and intimacy you feel when someone strokes you gently, or when you stroke a baby — that's C-tactile fibers at work. They connect directly to the brain's emotional center (the insula), traveling a completely different neural pathway from the "tool-type" receptors like Merkel discs.

Thus touch is more than a defense system — it shapes emotion. In the 1950s, American psychologist Harry Harlow conducted a famous experiment. He separated infant monkeys from their mothers and provided two "surrogate mothers":

"Wire mother": Made of cold, rigid wire mesh, but with a milk bottle on its chest providing food.
"Cloth mother": Wrapped in soft, warm fabric, but no bottle, no food.

The prevailing view held that "infant attachment to mothers is purely about satisfying physiological hunger." But the results proved otherwise: the baby monkeys spent nearly all their time clinging to the cloth mother, only briefly running to the wire mother when extremely hungry.

This experiment demonstrated that contact comfort and emotional security matter as much as — if not more than — food. Long-term tactile deprivation causes severe physical and developmental disorders. Increasingly, research is tracing the causes of ASD (autism spectrum disorder) and ADHD (attention deficit hyperactivity disorder) to abnormalities in the tactile system.

Touch is a severely underestimated source of foundational intelligence.

And this system isn't fixed — it's highly plastic. In your cerebral cortex, there's a mapping called the "cortical homunculus" — each body part corresponds to a cortical region.

Somatosensory homunculus. Image source: Adobe Stock

But this little person's proportions bear no resemblance to your actual body: hands, lips, and tongue occupy huge areas, while torso and lower limbs get tiny patches. Cortical area doesn't depend on a body part's physical size, but on its functional intensity and sensitivity.

Crucially, this map changes. A blind person reading Braille long-term will physically expand the cortical region corresponding to their fingertips. The same happens to a violinist's left-hand fingers. Conversely, if a limb is amputated, the corresponding cortical region doesn't sit idle — neighboring areas "invade" it, which is precisely why amputees experience phantom limb pain.

Your Cochlea Is a Fourier Analyzer

Understanding the richness of the tactile system naturally raises a question: why so many types of receptors?

Separating touch and pain makes sense — one for daily use, one for alarm. But why separate cold and heat? Touch and caress? Most puzzling: why separate low-frequency and high-frequency vibration? Can't one sensor capture the full spectrum?

To answer, we detour through hearing — because hearing is essentially a specialized form of "touch": the receptors in your ear receive pressure signals from air waves, also mechanical force.

If an engineer designed a sound collector, the simplest solution would be a single diaphragm, like a condenser microphone, capturing all frequencies at once.

But biology didn't do this.

The human cochlea is a spiral structure. Sound waves entering it cause different frequencies to resonate at different positions: high frequencies at the base, low frequencies at the apex. Hair cells at each position convert vibrations in their frequency band into neural signals, each sent separately to the brain.

Structure of the peripheral auditory system.

Image source: Ento Key, Anatomy and Physiology of the Auditory System

So we might say: the cochlea is a natural, passive Fourier analyzer. It takes complex sound waves mixed together and decomposes them into simple frequency components.

What is a Fourier transform? Imagine you're at a noisy party, simultaneously hearing voices, music, clinking glasses. What the Fourier transform does is unpack that sonic mess, telling you which frequencies are present and in what proportions. However complex a signal, it can be decomposed into a superposition of simple waveforms.

These "simple waveforms" are also called basis functions. Fourier basis functions are sine/cosine waves; the Fourier transform is the projection of basis functions into space.

This mathematical tool underpins all modern communications. Mobile signals, Wi-Fi, 4G/5G — all rely on Fourier transforms to allocate frequency bands, compress data, filter noise. Without Fourier transforms, no digital communication. And biology "invented" the same principle hundreds of millions of years ago.

Now back to touch. If we understand each tactile receptor as a "basis function" — Merkel discs for static pressure components, Meissner for low-frequency vibration, Pacinian for high-frequency vibration, Ruffini for tension — then what the tactile system does is identical to a Fourier transform: decomposing complex physical world signals into a set of basis functions, extracting the features of each component.

This analogy extends further: perception = projecting and expanding the world onto multi-layer neural basis functions.

And biology does this because the brain never contains a 4K world model map. The purpose of biological perception is the reverse: inferring generative factors from chaotic sensory input, extracting low-dimensional features most useful for survival and action.

This is a design of parsimony: robust, heavily redundant, highly plastic, low computational cost. Strip noise and survival-irrelevant information; retain only key generative factors.

The world is "low-dimensional generation, high-dimensional projection." The core of foundational intelligence is finding low-dimensional distributions within high-dimensional data. Biological perceptual systems are inherently built to do exactly this.

1 Million Fibers in Parallel: No Compression, No Encoding

After receptors decompose the world into low-dimensional components, the next step is transmission. Biology's approach differs fundamentally from everything communications engineering teaches us. Modern communications pursues signal compression and encoding, stuffing maximum information into limited bandwidth. Biology doesn't do this.

Take vision as an example (the optic nerve is thicker than tactile nerves, easier to observe anatomically). Each human optic nerve contains over 1 million nerve fibers. The retina has about 120 million photoreceptors (rods plus cones) — roughly 100 photoreceptors per optic nerve fiber. These fibers bundle into a cord about 2 millimeters in diameter and 6 centimeters long, passing directly from the back of the eye into the cranium, connecting to the brain's visual center.

1 million fibers transmitting simultaneously, each carrying signals from its own receptive field. No compression, no encoding, no mixing. Raw, parallel, direct delivery.

And these signals aren't transmitted analog — they're digital. The "action potentials" running along nerve fibers are essentially 0-and-1 pulses: fire or don't fire, no intermediate state. This ensures accuracy over long distances.

Once signals reach the brain, they don't get thrown into one pot.

Visual signals first reach the lateral geniculate nucleus (LGN) of the thalamus. Thalamic slices clearly show layered structures: signals from different types of retinal ganglion cells are sorted and stored in distinct layers. Not scrambled and recoded — structured, layered preservation.

Location and sliced structure of the macaque lateral geniculate nucleus in the brain.

Image source: Chinese Academy of Sciences Shanghai Branch, "From Retina to Visual Cortex — How Much Do You Know About the Visual System?"

Further up, signals enter the primary visual cortex V1. Neurons in V1 selectively respond to visual signals of different spatial frequencies; this layered processing mechanism closely resembles multi-level filtering in Fourier analysis. V1 neurons have receptive fields approximating Gabor functions, capable of extracting edges, textures, and local structures from images.

Parallel processing model of the visual pathway.

Image source: Chinese Academy of Sciences Shanghai Branch, "From Retina to Visual Cortex — How Much Do You Know About the Visual System?"

Those familiar with deep learning will find these terms familiar: the basic Gabor filters in CNN image recognition were originally inspired by the human visual system. So it's not biology learning from AI — it's AI learning from biology.

Vision, hearing, and touch all share similar Fourier analyzer structures: low-dimensional acquisition → digital parallel transmission → structured layered delivery to thalamus → cortical alignment and integration analysis.

The thalamus is far from a simple relay station. Recent research reveals it as a center for cross-modal information integration: its higher-order nuclei (intralaminar, medial) are supramodal, cognition-related "gating" centers, integrating broad inputs from cortex, basal ganglia, and limbic system, evaluating information's "global relevance." When you touch an object, you simultaneously use vision to confirm its texture; touch helps calibrate visual expectations. This cross-modal integration happens in the thalamus.

Biology's World Model Is Necessarily "Embodied"

With the full architecture of biological perception clarified — low-dimensional acquisition, digital parallel transmission, thalamic structured integration — a larger question emerges: what does biology's "world model" look like?

Three words: it's embodied.

What does embodied mean? The world model isn't an abstract computation independent of the body — it grows directly out of the body.

First, the body is space. Your body gives you a natural coordinate origin; you are the center of the world. Your arm length measures nearby space; your stride measures distant space. Touch, vestibular sense, and proprioception integrate to merge body and external world into one whole. You know where your hand is with eyes closed; you walk without looking at your feet. This isn't calculated — it's a spatial model built into the body.

Second, biology needs no labels; it learns through self-supervision. Training a large language model requires massive labeled data plus human feedback, but a baby learning to walk gets no one telling them "this step right, that step wrong." They fall, get up, and the brain continuously compares "what I predicted would happen" with "what actually happened," adjusting automatically.

How can biology do this? Because low-dimensional data carries its own constraints. Perceptual input is low-dimensional and structured, so the brain can distinguish deviations between prediction and observation, forming closed-loop feedback. This is also why true end-to-end learning in AI today only works in low-dimensional input scenarios: autonomous driving is essentially one-dimensional linear control, proteins are sequences, language is sequences.

Third, foundational intelligence isn't "computed" — it's "grown." Perception, movement, and these capabilities are built step by step through bodily development and neural/brain plasticity during ontogeny. This grown neural structure constitutes biological foundational intelligence — the organism's embodied world model. Once established, this cognitive world model doesn't disappear; like learning to ride a bicycle, you don't forget even after a decade. We call this muscle memory.

Following this logic, FreeS Fund has made extensive investments in embodied intelligence:

FreeS Fund's investment map in the embodied intelligence sector.

The Lesson of a Digital Fruit Fly

If intelligence is "grown," we naturally think of a principle from molecular biology: structure determines function.

This concept has long been axiomatic in molecular biology: a protein's three-dimensional structure determines its function. AlphaFold shocked academia precisely because it could predict protein structure from amino acid sequences, thereby inferring function.

This principle holds equally at macroscopic scales. The most powerful evidence is that digital fruit fly from the opening.

In March 2026, Eon's team replicated the fruit fly brain's connectome (roughly 125,000 to 140,000 neurons, over 50 million synapses) into the MuJoCo physics engine. No behavioral rules were written, no reinforcement learning training performed — just faithful reconstruction of neuronal connectivity. Adjust weight parameters, connect to the physics engine, and the fly "came alive."

It walks, turns, grooms itself, feeds, vibrates its wings — 95% behavioral similarity to real fruit flies.

The significance of this experiment: behavior emerges directly from neural structure. No programming, no training — the structure itself encodes function.

Understanding this explains AI's famous Moravec's Paradox: why higher cognition (logical reasoning, chess) is relatively easy for AI, while basic capabilities (walking, grasping, environmental perception) are extraordinarily difficult?

The answer lies in structural complexity differences. Logical reasoning may seem "advanced," but relies on relatively simple, formalizable structure — symbolic manipulation, rule-based inference. Walking, grasping — these "simple" capabilities rest on complex neuro-musculo-skeletal structures refined over billions of years of evolution. Walking feels simple because your body does all the computation for you.

So large language models can pass the bar exam, yet fail to control a robot steadily holding a cup of water.

The former is higher cognition with simple structure; the latter is foundational intelligence with complex structure.

Will AI Become Conscious?

Understanding "structure determines function," we can attempt a slightly philosophical question: AI is so smart now — could it become conscious?

Five years ago, FreeS Fund's internal discussions still centered on "can AI reach human-level intelligence." Today that question barely needs discussion — AI has surpassed humans in many dimensions. But "conscious" and "very smart" are two different things.

And the gap between them may be "sense of time."

A foundational prerequisite of consciousness is continuous subjective experience. And continuous subjective experience first requires temporal sense: you must feel "before" and "after," feel things flowing. Without temporal sense, no continuous experience; without continuous experience, no consciousness.

But time is peculiar. Those who've studied physics know that in theoretical physics, time is not an operator — merely a parameter. Time cannot be directly perceived; you have no "time receptor," unlike your eyes as light receptors or cochlea as sound wave receptors.

So where does biological temporal sense come from?

It emerges from the structure of a "touch-dominated neural dynamical system." Through NMDA molecular receptors, synaptic STDP, complex dynamical interactions between time cells — temporal sense emerges as a function from structure. The same logic as the digital fruit fly's behavioral emergence.

Look back at large language models. LLMs have no temporal sense; no intrinsic temporal flow, only token sequences — discrete, not continuous. The "time" and "causality" they process come from input text structure, not the system's own dynamics.

This explains a phenomenon many intuitively sense but struggle to articulate: LLMs have strong logical reasoning but weak causal understanding. Logical reasoning needs no temporal arrow; "if A then B" is a static formal relation, emergent from linguistic structure. But causality is inherently temporal — "A causes B" means A before B, with a temporal direction between them. A system without intrinsic temporal flow cannot truly understand causality.

At least with current AI architectures, consciousness cannot emerge.

This isn't saying AI isn't smart enough. Rather, consciousness may require a different underlying structure: one with intrinsic temporal dynamics, embodied, emerging from development. And this is precisely what biology spent billions of years evolving.

Conclusion

Starting from tactile receptors, passing through Fourier transforms, neural signal transmission, thalamic integration, embodied world models, structure determining function, and finally landing on AI consciousness — threading this arc together, we arrive at a core conclusion:

Biological intelligence isn't "computed" — it's "grown."

The perceptual system uses Fourier decomposition to extract low-dimensional features; the nervous system uses digital parallel transmission for faithful delivery; the thalamus performs cross-modal integration; the cortex completes higher-level processing. This entire architecture wasn't designed by any genius — it's the product of billions of years of evolution. Every layer of structure carries function; every function is rooted in structure.

Today's AI takes a different path: larger parameters, more data, stronger compute. This path has achieved astonishing results in logical reasoning and language understanding, but in physical world perception, movement, and causal understanding, it has hit a wall.

LeCun is betting on world models as the way out. From a biological perspective, we see a more fundamental hint: perhaps the answer lies not in algorithms, but in structure. As the digital fruit fly tells us: when the structure is right, function emerges on its own.

Physicists pursue grand unified theories, but in the colorful biological world there are no unified models — every living organism is a unique model.

One flower, one world; one grain of sand, one heaven. There is no model of the world, only worlds of models.

I Found a New Business Closed Loop in OpenClaw

In the AI Era, Humans Are Panicking, Screens Are Square

Is AI Healthcare Having Its DeepSeek Moment? | FreeS Research Institute