How Far Are We From "Ready Player One"? The Technical Challenges of XR Through the Lens of Apple Vision Pro | FreeS Report 33

峰瑞资本峰瑞资本·October 19, 2023

What other innovation opportunities remain in the XR space?

Over 30 years ago, Qian Xuesen proposed translating "Virtual Reality" as lingjing (灵境, "spiritual realm"), predicting that VR technology would become a pathway for human-machine integration and the evolution of human society. At the turn of the century, The Matrix trilogy showed audiences the indistinguishable blend of illusion and reality within the Matrix's virtual space.

At WWDC in June 2023 — Apple's annual Worldwide Developers Conference — the Vision Pro, developed over seven years, made its debut. The Vision Pro is a mixed reality device that fuses AR (Augmented Reality) and VR (Virtual Reality). From human-computer interaction to hardware specifications to its operating system, it set new product standards for the XR industry, which had been in an exploratory phase in recent years. The Vision Pro's launch was hailed by the industry as "the iPhone moment for XR" (XR: extended reality). From the hyper-realistic virtual spaces of science fiction to Apple's breakthrough XR device, this report takes a deep dive into the XR industry and explores the following questions:

  • What do VR/AR/MR/XR mean, respectively?
  • How did the XR industry develop?
  • With the introduction of AI technology, why was the Vision Pro's launch hailed as "the iPhone moment for XR"?
  • What technical specs should you look at when choosing an XR device?
  • How far are we from the ideal XR experience?
  • What are the key technical challenges currently facing XR industry progress?
  • What entrepreneurial and investment opportunities exist in the XR industry?

If you're also following XR or building a startup at the frontier of technology, feel free to reach out to the author of this article: Qianhang Yan, Vice President at FreeS Fund (qianhang@freesvc.com).

Reader Giveaway Once the technology matures, what do you think XR devices could help us do? What would VR have to achieve for you to open your wallet?

Share your thoughts in the comments. By 5:00 PM on October 25, the five most thoughtful commenters will each receive a copy of Klein Bottle by the legendary mystery-writing duo Futari Okajima. This suspense novel, written 30 years ago, envisioned a virtual reality gaming device. It currently holds a rating of 8.9 on Douban.

/ 01 / XR: From Fantasy to Reality

Over 30 years ago, in a letter to Wang Chengwei, then head of the intelligent computer expert group under China's "863 Program," Qian Xuesen proposed translating "Virtual Reality" as lingjing (灵境). In subsequent correspondence, he predicted that VR technology would become a pathway for human-machine integration and the evolution of human society.

▲ Image source: Qian Xuesen Library, Shanghai Jiao Tong University

The Matrix trilogy, released at the turn of the century, showed audiences the interweaving of illusion and reality within the Matrix's virtual space. In the films, people enter fully simulated virtual environments through brain-computer interfaces like the "jack in the back of the skull" — a vision that aligns remarkably closely with Qian's vision of lingjing as a future of "human-machine fusion."

From science fiction back to reality: in 2012, the startup Oculus completed its first crowdfunding campaign on Kickstarter, bringing VR devices into the public consciousness. Oculus was later acquired by Facebook, Meta's predecessor. In 2016, the Oculus Rift launched, its head-mounted display paired with motion controllers becoming the dominant device form factor in the VR industry around the "VR year zero" of 2016. HTC, Sony, and domestic player Pico all entered the market with their own products.

▲ The Oculus Rift CV1, released in 2016. Image source: Touchdesigner

When discussing virtual reality, we can't avoid the concepts of AR, VR, MR, and XR. What exactly do they mean, and how do they differ?

AR (Augmented Reality) is typically translated as "增强现实." Unlike VR, AR overlays virtual content onto the physical real world. In 2012, Google released a prototype of Google Glass, representing an early commercial exploration of consumer-grade AR devices.

▲ At the 2012 Google I/O conference, a skydiver prepares to jump wearing Google Glass. Image source: ifanr

MR (Mixed Reality), translated as "混合现实," was originally introduced through research by scientists Paul Milgram and Fumio Kishino. In 2015, Microsoft's release of HoloLens further concretized the MR concept, emphasizing the blending and interactivity of virtual and real worlds.

▲ HoloLens 1. Image source: Microsoft

MR may sound similar to both AR and VR. The diagram below offers an intuitive comparison of the three.

▲ Image source: LinkedIn, "The Difference between AR, VR & MR"

In 2017, Qualcomm proposed the concept of XR (extended reality), encompassing the spectrum of VR, AR, and MR as an umbrella term for virtual reality interaction technologies. Qualcomm subsequently launched dedicated XR platforms and foundational hardware platforms.

▲ In September 2023, Qualcomm launched the second-generation Snapdragon XR2 platform. Image source: Qualcomm

/ 02 / Apple Vision Pro: Technology and Experience

The Apple Vision Pro's debut was hailed by the industry as "the iPhone moment for XR." Although the Vision Pro has not yet officially launched for consumers, both media reviews and officially disclosed technical specifications have set a benchmark for the entire industry, demonstrating formidable technical and product capabilities while building considerable anticipation.

▲ Image source: Apple

In appearance, the Vision Pro resembles a diving mask. The frame curves gently around the face for a better fit, improving comfort. The Vision Pro implements XR functionality through VST (video see-through). Rather than looking directly through "lenses," users see the external environment captured by cameras and displayed on screens in front of their eyes with low latency.

For its system, the Vision Pro runs visionOS, which Apple calls "the first operating system designed for spatial computing." visionOS resembles iOS on the iPhone in some ways. Unlike the iPhone, however, its interface presents with a sense of three-dimensional space, enabling spatial manipulation through eyes, hands, and voice.

▲ Spatial operating system visionOS. Image source: Apple

On the hardware side, the Vision Pro has been described by industry insiders as "sparing no expense, pushing specs to the limit." The device packs 12 cameras, a LiDAR sensor, plus dual processors — an M2 chip and an R1 chip. The M2 handles graphics computation, while the R1 manages environmental awareness and hand gesture recognition. It uses Micro OLED displays with a combined 23 million pixels for both eyes, representing the highest display specifications in the industry. Through innovations in both software and hardware, the Vision Pro elevates the user experience — whether in an office or at home, the display panels and digital content in front of users can match their surroundings.

▲ Image source: Apple

Specifically, the Vision Pro reconstructs the previous XR experience from three angles: spatial audiovisuals, sensory interaction, and spatial OS with content ecosystem.

First, in spatial audiovisuals, the Vision Pro's screen reaches a pixel density of 3,386 PPI, with an estimated spatial resolution (pixels per degree) of roughly 35 — meaning about 35 pixels fill each 1° angle of the field of view. That's more pixels per eye than a 4K TV. Building on Apple's spatial audio technology developed for AirPods, users can pinpoint where sounds are coming from, creating a more precise spatial audiovisual experience.

▲ Image source: Apple

Second, for interaction, the Vision Pro employs two human-computer interaction designs: "eye-tracking focus interaction" and "hand gesture interaction" — no controllers or other external devices needed. This interaction paradigm, designed around human instincts, is called sensory interaction, or natural interaction.

▲ Eye-tracking focus interaction. Video source: Apple

Finally, for spatial OS and content ecosystem, Apple has always excelled at integrating hardware and software development, using software system innovations to craft better product experiences. This time, Apple introduced the spatial operating system visionOS. Going forward, more applications may plug into the visionOS ecosystem, becoming integral to the Apple product experience.

▲ Hand gesture interaction. Video source: Apple

03 Key Technical Elements of the Ideal XR Experience

Based on the information Apple has released and current media reviews, the Vision Pro has pushed the user-facing XR experience to new heights — whether through the visual experience delivered by Micro OLED, or the smooth interaction and application experience enabled by organic hardware-software integration. Moreover, the Vision Pro achieves seamless switching between VR and AR, representing a breakthrough innovation compared to VR headsets and AR glasses in recent years.

Of course, this experience upgrade relies on "extreme spec-stacking" in technology.

If we step beyond current technical capabilities, what would an ideal XR experience that meets user expectations look like?

▲ Image source: The New York Times Chinese edition

VR and AR target different experiences. The former pursues complete replacement of the real world with a virtual one; the latter pursues seamless blending of virtual and real worlds.

From a science fiction perspective, future VR might resemble the film Ready Player One, immersing users in an entirely new world that completely substitutes for reality. This requires fully simulated sensory experiences dominated by vision and hearing. Future AR, meanwhile, might resemble Iron Man, with users interacting with virtual panels in the real world — emphasizing seamless fusion between virtual and physical.

▲ Image source: China Youth Daily

To reach this ideal state, what technical specifications must next-generation XR achieve?

Next, we'll break down the technical metrics for evaluating XR based on user experience: first, visual spatial dimension metrics — single-frame image quality; second, visual temporal dimension metrics — dynamic response speed; third, interaction and wearability experience.

Visual Spatial Dimension

1. Spatial Resolution

Spatial resolution reflects the fineness of VR display rendering.

Smartphone screen technology has matured considerably, with display effects approaching the limits of human vision. But VR devices differ from phones — the screen sits only 3-5 centimeters from the eye, an extremely near-field display. Therefore, when evaluating single-frame image quality in VR, we typically use spatial resolution (Pixels Per Degree, PPD).

Spatial resolution refers to the number of pixels filling each 1° angle of the field of view, determined jointly by display resolution and the optical system. Ideally, more pixels per degree is always better.

▲ Image source: Paoka AI

The resolution limit of normal human vision falls between 50-60 PPD. Below 50 PPD, users can perceive a visual grid pattern — the so-called screen-door effect. Current mainstream VR products sit around 19-25 PPD; for instance, the PICO 4 achieves 20.6 PPD, still far from the ideal value.

Apple's Vision Pro is estimated at around 35 PPD — already far surpassing other manufacturers' products — yet its spatial resolution needs to improve by over 75% to meet simulation-grade demands. So far, no consumer product has reached above 40 PPD, while exceeding 60 PPD would require roughly 8K per eye, 16K total resolution compared to conventional screens. Thus, display chip technology still has substantial room for advancement.

▲ Top-left, top-right, bottom-left, and bottom-right show display effects at spatial resolutions (PPD) of 10, 20, 30, and 40 respectively; image source: ANSYS-HFSS

2. Rendering Quality

Gamers will surely find "rendering quality" familiar — it directly impacts the immersion of game visuals. For XR, which centers on immersive experiences, rendering quality matters equally.

The four elements of rendering include material textures, geometric models, lighting, and sampling precision of rendered objects. The higher these parameters, the better the rendering quality, and the more realistic the virtual environment users perceive — driving higher immersion.

Rendering of virtual objects and environments relies on real-time rendering pipelines in computer graphics. The most advanced real-time rendering technology in the industry is the hybrid ray-tracing rendering pipeline.

Compared to the rendering effects Meta showcased for its metaverse, the 2020 game Cyberpunk 2077 already employed the industry's most advanced hybrid ray-tracing rendering pipeline.

▲ Mark Zuckerberg's selfie in Meta's metaverse. Image source: Meta

Cyberpunk 2077 visual effects (2020). Image source: Cyberpunk

From the user's perspective, insufficient rendering quality creates a massive gap between virtual environment experiences and reality, producing significant cognitive dissonance.

If we can already see exquisitely crafted film animation, why does rendering quality remain a technical hurdle for XR?

Film effects use offline rendering — each frame is rendered in advance with extensive time investment before reaching the screen. A single film may require months of rendering time and hundreds of millions in funding. XR, by contrast, demands real-time rendering: each rendered frame is pushed directly to the display, imposing extreme demands on GPU compute power and requiring high-performance graphics cards. Yet high-spec cards remain scarce, and mobile devices cannot support them. In the near term, rendering quality remains constrained by available compute power.

Visual Temporal Dimension

The temporal dimension of vision is dynamic response speed. Whether frame rates are high enough and latency low enough directly determines whether displayed imagery appears smooth and fluid, and whether users experience dizziness or disorientation while wearing the device.

Frame rate represents how many times the image redraws per second, measured in Hz. Higher frame rates mean smoother, clearer dynamic content. Current mainstream VR headsets operate around 90 Hz, while phones and monitors have already progressed above 120 Hz.

XR "motion-to-photon latency," or MTP (Motion To Photons), refers to delay caused by the time differential in data transmission, rendering, and input command processing.

Excessive latency easily triggers users' "3D motion sickness." If a user's eye movement, head movement, or body movement is followed by a screen change only after 100 milliseconds, they will distinctly experience a drunken, dizzying sensation. Ideally, latency must stay below 20 milliseconds to avoid triggering human sensitivity.

▲The image illustrates latency caused by computing in XR devices. Source: ResearchGate

Interaction and Wearability

1. Natural Interaction

Current mainstream XR products still rely on external controllers as the primary interaction method, leaving considerable room for improvement in operational convenience and intuitive matching. For instance, when a user puts on a VR or AR headset and sees an object in front of them, their instinct is to reach out and grab it. But limited by the technical capabilities of most devices, users often feel they can't quite reach what's in front of them, or their hand passes straight through the image.

Natural interaction refers to users interacting with devices in natural, intuition-based ways. The most natural approaches combine hand gestures with eye tracking, plus voice interaction, since no additional controllers are needed. Vision Pro follows this natural interaction path.

Haptic sensors are another natural interaction solution. However, haptic sensing equipment remains relatively expensive and not sufficiently stable. So during development, Apple abandoned the haptic sensor approach for Vision Pro, opting instead for natural interaction combining hand gestures, eye tracking, and voice. The image below shows some of the gesture operations supported by visionOS. For example, opening and pinching fingers enables image zooming in and out.

▲Apple Vision Pro interaction methods. Source: Apple

2. Device Weight Reduction

Wearability is the result of comprehensive product integration, and weight reduction remains the direction of product design.

Current VR devices average around 500g in weight. Despite substantial ergonomic investment in product design, extended wear still causes discomfort. In contrast, AR glasses have already achieved device weights under 200g.

Notably, weight reduction should be pursued only after the aforementioned technical indicators are adequately addressed. If weight reduction is pursued blindly at current technical levels, the overall user experience cannot be guaranteed.

▲Apple Vision Pro. Source: Apple

04 Key Technical Challenges for XR Advancement

Currently, the enormous challenge facing the XR industry is that users expect better product experiences, but present technology temporarily cannot meet these high expectations. Unlike desktop devices, XR devices must rely on the limited computing power and display capabilities of mobile platforms, while also consuming relatively high power.

From the perspective of technology venture investment, FreeS Fund has consistently focused on breakthroughs at key technical points and the explosive commercial opportunities they bring. After experiencing a wave of enthusiasm in 2016, the XR industry gradually settled into some commercially valuable B-end demands, such as education and industrial inspection. On the C-end side, better product experiences depend on key technological iteration.

Having deconstructed the relevant technical indicators, we believe the critical R&D challenge lies in improving XR visual experiences. This involves not only hardware-side display modules and optical system solutions, but also depends on further development of real-time CG rendering technology at the software level. Additionally, how to improve motion-to-photon latency and enhance immersion remains a major pain point for current XR devices.

Next-Generation Display Module: Micro LED

The screens people see in XR devices are primarily composed of display modules. Currently, there are three main technical approaches: Fast-LCD, Micro OLED, and Micro LED.

▲Source: Essence Securities Research Center

Fast-LCD has low manufacturing costs and has achieved mass production, but its response speed is relatively slow. Micro OLED features high pixel density and high contrast. However, OLED is an organic material prone to aging, and its brightness is considerably lower than LED.

Compared to Fast-LCD and Micro OLED, Micro LED may be the more ideal next-generation display module solution. According to Zhidx reporting, Apple has invested heavily in Micro LED development. In May 2014, Apple acquired LuxVue, a company focused on Micro LED business with multiple patents applicable to Apple devices.

Micro LED combines the advantages of both LCD and OLED, featuring high pixel density, high contrast, high resolution, and other characteristics. Micro LED brightness can theoretically exceed 100,000 nits, two orders of magnitude higher than Micro OLED. Moreover, because Micro LED is less susceptible to contamination during production, it has higher yield rates and longer lifespan.

Currently, full-color Micro LED remains in the research phase and has not achieved mass production. To manufacture full-color Micro LED, there are mainly two paths: tri-color integration and quantum dot color conversion. Both approaches are in early technical stages with different challenges.

The tri-color integration scheme demands extremely high micro-fabrication precision, requiring all monochromatic devices to maintain good consistency. In this scheme, red light devices suffer far lower luminous efficiency than green and blue devices when scaled down to micron or smaller sizes. The quantum dot color conversion approach can circumvent these fabrication difficulties by using relatively mature blue light devices as the light source, then employing quantum dot materials to convert to red and green light.

However, skip the fabrication difficulties and you face material challenges. Current quantum dot materials are prone to aging and have relatively short service lives. Additionally, quantum dot device color conversion layers sometimes leak light, causing mutual crosstalk.

Due to its relatively simpler fabrication, quantum dot color conversion is currently the mainstream choice for most startup teams. But from an investment perspective, both technical routes remain in competition, awaiting a relatively more mature inflection point. Whichever route ultimately prevails will drive Micro LED technology toward industrialization. We will continue monitoring technological innovations in Micro LED-related processes, materials, and equipment.

Optical Design

XR device screens sit very close to the human eye, making it difficult for the eye to focus. Therefore, a unique optical system must be designed to conduct light through multiple transmissions in a compact space, enabling the eye to clearly focus on content displayed on the panel.

▲VR device optical design solutions. Source: Zhidx

▲AR diffractive waveguide solution. Source: Digilens

Currently in the VR field, Pancake ultra-short-throw folded optical path solutions have become the mainstream choice for new head-mounted devices. Presently, both Apple Vision Pro and ByteDance's Pico 4 adopt this approach. According to CLS reporting, Pancake addresses the pain point of overly bulky VR modules, reducing device thickness by approximately 50%.

In the AR field, diffractive waveguide solutions will become the mainstream optical design for AR, with overall thin and light structure, wide field of view (FOV), and high resolution. However, diffractive waveguide technology currently lacks sufficient overall maturity, suffering from chromatic dispersion (color shift caused by light refraction and scattering) and light loss control (brightness loss during light propagation).

Looking toward the frontier, metalens and microlens arrays represent the industry's anticipated next-generation XR optical technology paths. Metalens is a planar lens using metasurfaces — a series of subwavelength-thickness planar two-dimensional structured materials — to focus light, enabling flexible control of incident light amplitude, phase, polarization, and other parameters.

Microlens arrays are conceptually similar to metalenses, composed of arrays of micron-scale tiny lenses that can achieve complex functions impossible with traditional single optical lenses. Intel developed a compact small-form-factor VR headset with 180° FOV based on heterogeneous microlens array technology.

The advantage of both metalens and microlens array approaches is higher integration, enabling smaller, thinner, lighter optical systems to meet XR device display requirements. Though both solutions remain in early R&D stages, they have already demonstrated very strong application potential.

Metalens concept illustration. Source: 36Kr, MetalenX

Real-Time Cloud Rendering

As mentioned earlier, mobile devices have limited computing power, inevitably constraining XR devices. If rendering could be moved to the "cloud" — that is, "real-time cloud rendering" — it might break through computing constraints and achieve high-quality real-time rendering. Currently, the most intuitive application of real-time cloud rendering is cloud gaming.

But real-time cloud rendering faces two technical difficulties.

First, real-time cloud rendering is currently limited to single-GPU configurations and does not yet support distributed computing — unlike AI large models, which natively support distributed computation.

Currently, the industry is exploring multi-GPU distributed rendering in the XR space to boost computing power, but progress has been slow. If distributed rendering achieves a breakthrough in the future, we might use VR devices the way we use AI-powered apps on smartphones today — experiencing computing performance far beyond what the mobile hardware itself could deliver.

Real-time cloud rendering's other challenge is that it adds a network-based video streaming layer, making latency control even harder. For managing network latency in video, optimized video codecs and asynchronous client-cloud rendering are the main directions of development.

Video codec optimization means replacing CPU and GPU processing with a set of cloud-dedicated, custom encoding and decoding chips during video streaming to reduce latency. This can bring the entire encode-decode process down to under one millisecond, compressing time at every step as much as possible.

Asynchronous client-cloud rendering means running frame rendering and terminal display in parallel. The cloud renders a base frame, while the terminal performs secondary calculations based on the user's position and pose to generate the real-time frame, thereby reducing latency.

So in rendering technology progress, there are opportunities not just in mobile hardware — such as network and data communication at more advanced nodes — but also in software, where new rendering techniques better suited for cloud rendering may emerge.

XR-Dedicated Co-Processors

Since the first half of 2023, the dual-chip solution of a main compute chip plus a co-processor chip has become the mainstream processor architecture in the XR industry. By having a dedicated XR co-processor chip assist the main chip, the two chips working together can effectively improve performance and computational efficiency.

▲ Image source: Apple

As a dedicated chip responsible for spatial computing and perception, the XR co-processor is analogous to the M-series chips previously used in phones. It can handle complex functions such as perception synthesis from front-facing cameras and sensor data, SLAM mapping, spatial mapping, and hand tracking. The main processor chip only needs to handle core logic operations.

In Apple Vision Pro, the R1 chip serves as the co-processor, handling data processing from 12 vision cameras, 5 sensors, and 6 microphones. It enables core spatial computing functions including pose tracking, eye tracking, 3D environmental perception, and hand tracking, with a dynamic latency of just 12 milliseconds.

▲ Image source: Semiconductor Industry Review (Author: Chen Wei, Qianxin Technology)

Currently, the tasks executed by co-processors remain relatively simple. In the future, more complex functions such as foveated rendering and binocular focal point rendering may gradually shift to the co-processor. We believe that with Apple's demonstration effect, the definition and capabilities of co-processors still have room to expand, addressing more XR-specific needs. The dual-chip architecture for the XR industry will become a mainstream trend going forward.


Conclusion

From the perspective of technology venture investment, FreeS Fund has always paid close attention to breakthroughs in key technology points and the explosive commercial opportunities they bring. Apple's Vision Pro has both set an example for the XR industry and will help attract more attention and entrepreneurial resources to the XR space.

For XR devices to achieve better product experiences, they cannot do without iterative advances in key technologies. As technology matures, XR devices can be applied to more scenarios, delivering tangible core value to more users.

We believe the key challenge in technology R&D lies in improving the visual and interactive experience of XR. This involves not only hardware-side display modules, optical systems, and interaction modules, but also depends on further development of CG real-time rendering technology at the software level. Additionally, how to reduce motion-to-photon latency and enhance immersion remains a major pain point for current XR devices.

We will continue to monitor breakthroughs in related hardware and software technologies such as MicroLED, novel optical systems, and CG real-time rendering. These technological innovations will help XR devices become next-generation mobile computing devices in the future, and may even create transformation opportunities on the scale of smartphones replacing traditional PCs.

Looking ahead, perhaps within a few years, users putting on XR devices will be able to experience all sorts of interesting new applications that XR brings — like Iron Man. These new applications will include not just productivity enhancers, but entertainment applications too. Maybe by then, instead of scrambling for tickets to a Jay Chou concert, we'll be able to immerse ourselves from our living room sofa and instantly teleport to a seat at Beijing's Bird's Nest Stadium through an XR device.


Engagement

Once the technology matures, what do you think XR devices could help us do? What level of VR capability would make you willing to spend money?

Share your thoughts on these two questions in the comments. By 17:00 on October 25, the 5 readers with the most thoughtful responses will receive The Klein Bottle, written by the legendary mystery writing duo Okajima Futari. This suspense novel, published 30 years ago, imagined a virtual reality gaming device. It currently has a Douban rating of 8.9.

2023: What New Opportunities Has AI Brought to Gaming? | FreeS VC Dialogue

40 Years of 3D Printing: How Far From Niche Technology to Mass Application? | FreeS Report 31

Everything Has Two Sides: How to Use "Nuclear Radiation" to Make Anti-Cancer Drugs? | FreeS Report

How Should We Age in the Future? | FreeS Report 29

After ChatGPT Went Viral, Where Is AIGC Headed? | FreeS Report 28

What Will Humans Eat in the Future | FreeS Report 27

Star the FreeS Fund WeChat Official Account — timely business insights delivered straight to you