In the Age of AI, Humans Are Panicking, Screens Stay Square

峰瑞资本·March 12, 2026·8·1

Got an 8K screen but your eyes still feel exhausted?

We are living in an era drowning in AI hype cycles.

From OpenAI kicking off the large language model era, to AI-generated videos convincing enough to fool the eye — while Sora's afterglow still lingers, Seedance 2.0 has already delivered a fresh shock to the industry, not to mention the recent viral sensation Open Claw. AI technology is iterating almost on a weekly basis. We fear missing any hot trend, and fear even more that missing one means getting left behind by the times.

Yet strip away all the flashy tech, and you'll find they all ultimately point to the same war — the battle for human attention.

As the core terminal for carrying information, the evolution of screen form is the most vivid microcosm of this war. From the industrial era's pursuit of horizontal expansion, to the early smartphone era's relentless narrowing and lengthening for grip comfort, spawning various "remote controls" or "cinemascope" displays; now, they are all converging, almost in unison, toward a stable form — the square.

Take Apple's rumored first foldable iPhone, for instance. To solve the pain point of traditional foldables' "overly narrow inner screen," it has opted for a "wide fold" solution. Its unfolded 7.8-inch inner screen is no longer a slender rectangle, but a broad display with an aspect ratio of roughly 1.4:1.

Beyond phones, the same pattern holds across other domains: IMAX screens use a 1.43:1 ratio; Apple Vision Pro 2 maintains a single-eye display ratio of approximately 1.21:1.

This return to squareness isn't some Renaissance revival. It's the inevitable technical choice made by the audiovisual industry after a century of "horizontal expansion," now confronting information overload.

In fact, the square may be more suited to human physiological instincts than any widescreen format.

Goldmann's visual field map. The central white area corresponds to "the field both eyes can see together"; the textured areas correspond to "peripheral regions visible to one eye but not the other."

Source: 1964 NASA report, Bioastronautics Data Book.

As shown, the field both human eyes can see together naturally tends toward a shape closer to a circular square. And in the present moment, when AIGC has driven the cost of content generation toward zero and content supply is growing exponentially, square screens can capture users' shrinking attention more effectively than widescreens — because they can project information more efficiently and precisely onto the focal region of the human eye.

In this industry research report, Technology-Driven Sensory Revolution, we will start from human physiological limitations and the history of audiovisual industry evolution, unpacking why the square may be the endpoint of screen evolution — and based on this view, how future technological evolution and entrepreneurial directions might shift.

We will answer the following questions in this article:

With attention thresholds now萎缩 to the second-level, how do we combat information overload?
Was the普及 of the 16:9 ratio driven by demand or compromise?
Why do Empresses in the Palace, EVA, and Hamlet share the same underlying logic?
How does AIGC leverage "Save the Cat" beats to drive content production costs to zero?
Why has the widescreen become an obstacle in the short drama era?
From IMAX to tri-fold screens, why are screens returning to "square"?

We continue to track AI developments in the audiovisual domain. If you are an entrepreneur or practitioner in this space, feel free to contact the author Li Gang at lig@freesvc.com.

Reader Giveaway What do you think audiovisual carriers will evolve into? Share your thoughts in the comments. By 17:00 on March 19, 2026, the most thoughtful commenter will receive a copy of World Cinema History.

Previous Research Reports: Looking Ahead to 2026: What Innovation Opportunities Exist in AI? | FreeS Report How Hard Is the "Artificial Sun"? Unpacking Core Technologies and Startup Opportunities in Controlled Nuclear Fusion | FreeS Report Succeeding Smartphones, Smart Wearables Are Becoming the New Engine of Consumer Electronics | FreeS Research

01

Humanity's "Token Retention" Mechanism

Before exploring the century-long evolution of audiovisual carriers, we must first confront a physiological fact: while display technology races forward at Moore's Law pace, the human eyes responsible for receiving information have undergone no substantive upgrade across tens of thousands of years of evolution.

First, vision is the most important channel for humans to receive external information — bar none. From anatomical data, the human optic nerve contains roughly 1.2 million nerve fibers, while the cochlear nerve responsible for hearing contains only about 30,000.

If we set the total volume of human sensory information reception at 100%, the visual channel accounts for roughly 70%, hearing about 20%, with the remaining 10% divided among smell, taste, touch, and proprioception.

This means our understanding of the world is built overwhelmingly on visual signal input.

Second, the brain needs protection against overload. Despite vision's enormous bandwidth, this doesn't equate to unlimited brain processing capacity. Human pursuit of novelty and exploration is built on sustained stimulation of the nervous system, and this stimulation has physiological thresholds. If external information inputs without restraint, the brain easily enters a state of exhaustion.

For self-protection, we've developed a highly streamlined information filtering and retention mechanism: faced with massive audiovisual input, what we ultimately retain is usually not some specific, pixel-level image, but a core "feeling."

For example, you once scored perfectly on an exam in school. Years later, you probably can't recall the specific questions, or even which subject it was, but you can vividly remember the sense of accomplishment in that moment. Borrowing a concept from the large model era, this is essentially "distilling massive redundant information into a small number of Tokens."

We are so adept at simplifying complex information, yet we've charged headlong into an era of extreme expansion on the information supply side. This means that if content cannot seize attention within seconds, the brain will directly classify it as "junk information."

02

A Century-Long "War" for Attention

The evolution of audiovisual history is, in essence, creators' continuous struggle to claim human sensory channels. And if we trace back to the early days before technological explosion, we find all carriers were "square" and "focused."

First, from still to moving: murals to theater. The earliest image records trace back to ancient murals. Whether with brush or outline, humans have always sought excuses to record beauty.

And when murals "moved," they became theater. Theater's visual focus was also fixed, on the square central stage. At this stage, there were no camera pushes or pulls, no montage; creators guided audience gaze through actors' blocking within the center of the frame.

Second, from silent to sound: action and atmosphere take over. At the end of the 19th century, film was born. The film masters of this era were essentially making "recordings of theater." We can look back at early film works, such as 1936's Modern Times — within the silent, black-and-white 4:3 frame, lacking color and sound, creators were forced to maintain audience attention through extremely exaggerated physical staging and facial expressions. In this ultimate square composition, every subtle movement was magnified.

Modern Times poster. Source: WIKI.

With the addition of soundtracks, creators began exploring sound's capacity for atmosphere building.

1966's The Good, the Bad and the Ugly demonstrated how sound takes over when vision fails. In this film segment, camera movement is extremely slow, with almost no substantive narrative progression. If watching only the visuals, modern audiences' vision would attempt to escape after the initial 7 seconds due to "insufficient information density."

However, Ennio Morricone's iconic score takes over in time. Tense environmental sounds, a distant yet menacing melody — when vision grows boring, they establish intense immersion through the auditory channel. Years later, you may not remember those characters' faces, but that score will instantly transport you back to that Western territory.

Third, from black-and-white to color, live-action to animation: the battle for control of imagination. In 1968, black-and-white became color.

In 2001: A Space Odyssey, Kubrick used color to vastly expand the boundaries of human imagination. Color imagery was not merely a visual upgrade, but a massive leap in information input efficiency. Different color combinations represented different emotional tendencies and environmental information, making information input far more dense.

Thereafter, animation began to emerge. Compared to the uncontrollable natural environments in live-action film, animation represented humanity's deepened takeover of audiovisual sovereignty — every frame, every play of light and shadow, every composition was determined.

Take Nobody (the Yao-Chinese Folktales episode): despite lacking expensive effects, it is deeply moving because its colors and exquisite narrative rhythm lock down mass emotion.

This shift from "recording reality" toward "artificial generation" essentially laid the logical groundwork for content creation in the AIGC era.

03

If "Square" Is the Answer, Why Have Screens Grown Wider for a Century?

Since "square" is the physiological optimum, why have screens grown increasingly narrow and elongated over the past fifty years? Behind this lies a tug-of-war between industry interests and physical reality.

In the late 19th century, Thomas Edison's laboratory established the 4:3 (1.33:1) aspect ratio while developing 35mm film. This ratio wasn't arbitrary — it most closely approximates the focal range the human eyeball can cover in its natural state, without extensive saccades.

35mm silent film standard specifications established in the late 19th century. Note the 4 perforations per frame in the diagram — this engineering constraint gave birth to the precise 4:3 aspect ratio. Source: Film Atlas.

Come the 1950s. As television entered millions of households, cinemas faced an existential crisis. To pull audiences back from their living room TVs to the silver screen, the film industry made one key change — since television had adopted film's 4:3 ratio, film had to become wider.

In 1953, 20th Century Fox introduced CinemaScope technology, stretching the aspect ratio to 2.35:1. Its core logic was: since the visual center was already saturated, fill the audience's "peripheral vision" with an extremely wide frame, simulating an immersive illusion.

In the 1980s, 16:9 gradually became the standard ratio for HDTV. But in actual evolution, this ratio's普及 owed more to constraints of physical space: in modern architecture, due to ceiling height limitations, manufacturers found it difficult to increase screen area by adding height, while horizontal expansion (making it wider) was cheaper and easier to achieve. The 16:9 ratio, and the subsequent 21:9, were essentially forced reshapings of visual habits.

This led to a rather awkward situation: human physiological vision tends toward an ellipse or even a shape closer to a circular square, yet we are placed in an extremely narrow and elongated visual environment. On 16:9 widescreens, our eyeballs must perform higher-frequency horizontal movements (saccades) to capture edge information — while the pace of content production keeps accelerating.

Why Empresses in the Palace = EVA = Hamlet?

While screen proportions kept expanding horizontally, the content structure carried by audiovisual carriers had already completed a highly standardized evolution.

In film, television, and content industries, there exists a precise narrative template called "Save the Cat."

This is an exact narrative formula: when the protagonist appears, when they fall into crisis, when they achieve redemption — all have extremely clear temporal nodes. Strip away the surface artistic packaging, and one could say that whether it's Empresses in the Palace, Neon Genesis Evangelion, or Hamlet, their underlying story beats are fundamentally identical.

For the content industry, this "metronome" is the guarantee of production efficiency. It proves that most successful narratives are not random creations, but precise hits on human psychological stimulation points.

But precisely because of this "metronome," AIGC's emergence enables exponential growth in content creation speed.

From Sora's stunning debut, to the follow-up of video models like Keling AI and Runway, to Seedance 2.0's emergence at the start of this year, AI video generation has become an essential skill for virtually all professional content creators.

Hollywood director Charles Curran publicly stated after hands-on testing that he completed a cinematic trailer in just 20 minutes, spending only tens of dollars. Seedance 2.0 has the potential to disrupt traditional film and television industry workflows and cost structures.

As AI video model technology continues to advance, the barrier to executing the "metronome" will only grow lower. And the format that pushes this metronome to its extreme is short drama.

In short dramas, traditional minute-level beats are further compressed to the second level. To retain audiences within extremely short attention windows, creators are forced to pile the highest-intensity stimuli within the smallest visual space — each episode only one or two minutes, every plot twist must hit its mark with precision.

This highly concentrated, high-frequency-reversal content format is making "wide visual backgrounds" lose their original raison d'être.

In 16:9 or even wider aspect ratios, the sides of the frame are typically used for environmental details, architectural composition, or distant atmosphere — information that in traditional slow-paced narrative was responsible for building "mood." However, when narrative becomes "short-drama-style beats," the audience's visual focus locks extremely stably onto the conflict point at the center of the frame.

For the receiving end, those wide backgrounds existing merely to fill peripheral vision become redundant. The old model relying on "atmosphere building," as in The Good, the Bad and the Ugly, is being replaced by more direct, more energy-efficient visual focus.

This also foreshadows that audiovisual carriers are about to complete a century-spanning cycle — from expansive physical occupation, back to that "square" most适配 to focal vision.

Four Pathways to Reshaping Sensory Experience

The return of audiovisual logic to "square" is not merely a change in proportion, but the prelude to a trend of full sensory takeover. Based on this, we can freely imagine the future: how will technological pathways and consumer hardware forms evolve? More importantly, where do entrepreneurial and investment opportunities lie?

First, establish "direct trust" in emotion.

The endpoint of immersion lies not in simple planar pixel stacking, but in establishing "direct trust" at the physiological level. Though vision occupies 70% of bandwidth, its falsity is easily detected by the brain; by contrast, spatial audio's sense of "presence" can elevate user experience dramatically.

Take the spatial audio feature普及 on recent iPhone generations: it allows users to freely choose in post-editing whether sound follows the camera, the environment, or exists off-screen. This function simulates real physical sound fields. New-generation headphones have also added stereo capabilities.

Just as a cat's meow suddenly sounding at the office doorway — even with extremely low stimulus intensity, its sense of spatial position directly establishes sensory "direct trust." This immersion is something simple planar pixel stacking cannot provide.

Second, capture everything first, define on demand later.

Capturing the most comprehensive data possible first, then defining on demand with AI assistance, may be an efficient pathway for content production.

First, changes in sensors: in the past we pursued extremely high sensor utilization rates, but now semiconductors are cheap enough to permit some extravagance. For example, both the iPhone 17 front camera and DJI use custom square sensors; Insta360 uses a circular imaging circle. These are all compatible with both horizontal and vertical shooting needs, completely freeing users from shooting orientation constraints, enabling full capture.

Zero Zero Robotics has innovated on the capture method itself. This is a camera capable of autonomous flight: press once, and it flies out on its own to record video, then flies back. The flying camera has different modes, such as auto-follow, cycling, skiing, and so on.

This type of hardware no longer demands that the shooter align with some specific ratio, but fully records space, with AI automatically distilling the content most适配 to focal vision in post-production.

Third, haptic feedback.

After vision (70%) and hearing (20%) are pushed to their limits, as long as brain-computer interfaces have not arrived, the remaining 10% of "somatosensory bandwidth" still holds opportunity.

High-end cinema features (such as 3D imagery, panoramic sound) are gradually being下放 to the home, such as Lazy Boy recliners with added vibration motors, wind feedback devices, and so on.

Fourth, the ability to reconstruct "content narrative."

When hardware completes "full capture," AI's core task shifts to "distillation" and "dimensional elevation" of information. This shift is not merely about improving production efficiency, but about serving as an extension of human perception, bringing us into visual dimensions previously unreachable due to physical and physiological limitations.

Temporal dimension: The human retina has natural physiological limitations; extremely high-speed motion usually registers only as blurry afterimages. Pixboom, through AI computational imaging and optical encoding technology, broadens human perception of the "instant" — such as a bullet passing through an object, or a water droplet splashing.

Spatial dimension: Traditional imagery records a "cross-section" of the world. But increasingly, companies are attempting to give users "full-dimensional capture" capabilities. DJI and Insta360, respectively, broaden recording boundaries through altitude and omnidirectional angles.

Zero Zero Robotics achieves follow-shooting through its autonomous flying camera. Zhuma Innovation further uses 3D Gaussian Splatting (3DGS) technology to pull these recordings from 2D planes into 3D space. This means the angle from which you view a photo is no longer limited to the moment of capture, but rather "entering the world."

Celestial dimension: Astrophotography is a relatively niche domain, with high shooting difficulty and long cycles, but startups are also targeting this space. Exponential Starry Sky is an intelligent astrophotography device that, through chip-level hardware innovation combined with AI technology, upgrades the shooting experience — simplifying the cumbersome processes of star-finding, guiding, shooting, and post-processing.

Conclusion

A century of audiovisual industry is a cycle from a "square starting point," through "horizontal expansion" inflation, and finally back to "squareness" in the era of information overload.

"Square" is not retro, but the form where sensory capture efficiency and hardware utilization rate reach consensus. From the latest tri-fold screens to Vision Pro, the endpoint of hardware is to completely break free from physical bezel constraints, returning to the most comfortable focal region of the human eye. When technology evolves to its endpoint, its purpose is no longer to occupy physical space, but to more precisely pluck that nerve called "resonance."