The Ultimate Core of Gaming + AI Is Fun | MonoX

Monolith砺思资本·July 25, 2025

What is the soul of a gaming experience?

In the current AI application layer, countless entrepreneurs have attempted or are currently pursuing the AI + gaming track. AI continues to push the boundaries of gaming — from endowing non-player characters (NPCs) with intelligence to automating the generation of massive amounts of content. Yet this track is no smooth road. Performance constraints, hallucination issues, player acceptance, and a series of other hurdles await those in the industry.

Recently, Monolith and Indie Light jointly organized the MonoX offline event "How AI Reshapes the Gaming Experience," bringing together over forty gaming industry guests in Shanghai to explore practical implementations of AI + gaming.

During the event, industry practitioners discussed:

Performance bottlenecks and solutions for integrating AI into games
Explorations of multimodal models in gaming
AI for game narrative creation
The current AI + gaming ecosystem: UGC platforms, players, and game companies

We've compiled the event's discussions and supplemented them with additional information, presenting them in article form. We hope you find it helpful.

Table of Contents:

Performance Challenges and Engineering Solutions
Multimodal Models in Gaming
AI in Narrative Design and Content Creation
The Current State and Future of the AI + Gaming Ecosystem
Making It For Fun

1. Performance Challenges and Engineering Solutions

The primary and core contradiction in gaming + AI is performance.

Games require real-time responsiveness. Typically, rendering and logic processing for each frame must complete within roughly 33 milliseconds; otherwise, noticeable lag degrades the player experience. However, current large language models (LLMs) often take several seconds for a single inference, failing to meet the demands of real-time NPC dialogue or decision-making. Furthermore, if AI decisions involve multi-step reasoning or multiple function calls, each additional step multiplies total latency, further worsening real-time performance.

To address this, the industry's approach is to seek optimization from an engineering perspective rather than relying solely on hardware breakthroughs. Several practical strategies include:

Constrained generation: When an AI's decision space is limited and enumerable, the model can be forced to choose only from valid options. This method enhances performance while preventing unreasonable outputs.
Latency masking: For AI response delays that cannot be fully eliminated, game developers commonly use game mechanics to conceal them. Experience shows that delays under one second are barely perceptible; delays of one to 1.5 seconds can be masked through character transition animations, visual effects, or brief interactive moments.
New models: Using newly released high-performance models. For example, Google's Gemini 2.0 Flash model achieves inference speeds of up to 600 tokens per second while maintaining solid intelligence levels. This makes previously infeasible scenarios possible. Such models are also often cost-effective — one practitioner's experience demonstrated that continuous AI dialogue services can be sustained for approximately 4 RMB per user per month.

Final Fantasy XIV

2. Multimodal Explorations in Gaming

Beyond text dialogue and decision-making, AI applications in games have expanded into multimodal generation, including images, voice, video, and sound effects.

2.1 Real-time scene generation remains unfeasible

One captivating vision in gaming involves using AI to generate game scenes in real time as players explore — employing diffusion models and similar generative algorithms to dynamically create the game world.

In practice, however, this real-time diffusion-based scene generation approach is currently almost unfeasible, with consistency issues proving difficult to overcome. If a player turns away and returns to the same location, purely AI-generated scene details often change, making it hard to maintain a stable, physically coherent world.

Therefore, real-time scene generation is temporarily unviable for conventional genre games. The only potentially applicable scenarios are those where "inconsistency" itself becomes a feature — such as dreamlike or surrealist-themed games, where unstable environments actually complement the atmosphere.

2.2 Latest advances in voice synthesis

High-quality voice synthesis technology is crucial for enhancing character immersion. Current state-of-the-art commercial TTS services (such as the latest version of ElevenLabs) already offer diverse voice options and allow emotional tone adjustment through prompts. These tools achieve excellent naturalness at the single-sentence level.

ElevenLabs

However, audio stability remains to be improved, particularly in multilingual support — sometimes performing perfectly in one language while degrading significantly in another. Additionally, when generating longer lines, models occasionally forget previous emotional settings, resulting in inconsistent emotional expression.

Current TTS still lacks deep contextual understanding and the ability to maintain emotional coherence. Yet relevant products are improving rapidly. Recently, some TTS services have added "context-aware" capabilities, automatically adjusting vocal emotions based on dialogue content or character state to better fit the scene.

For game development teams seeking customization, off-the-shelf products often fail to meet all requirements, leading many major studios to develop proprietary TTS solutions. While self-developed TTS requires substantial investment, it currently delivers better performance and fewer errors, satisfying the high demands for personalized voice acting in large-scale games.

2.3 Video generation cannot yet be directly used for game content creation

AI video generation technology has made leaps and bounds in recent years, approaching the threshold of being "usable" for game development. A batch of tools and models have emerged (such as Keling AI, Veo3, etc.) capable of generating high-quality short video clips from text or images.

Veo3

However, current AI video generation still carries significant randomness. Creators often need repeated attempts to achieve ideal results — perhaps generating 10 or even 20 times before "drawing" a satisfactory clip in all respects. Different video models excel at different subjects: some suit realistic styles, others excel at animation. This requires creators to understand and select appropriate tools. In game development, these tools are currently used more for producing promotional trailers and prototyping cinematic cutscenes, rather than directly generating in-game content.

2.4 Sound effects and scoring: an awkward position in the near term

Compared to images and video, AI performance in sound effects and music lags considerably.

Current attempts to use AI for generating various game sound effects often yield unsatisfactory results. Many auto-generated sound effects sound distorted, artificial, and lack the polish of handcrafted work. By data metrics, the gap between AI sound effects and human samples is significant. Most sound designers hold negative views toward AI sound effects: rather than spending time adjusting AI outputs, it's more efficient to select from existing libraries or have professionals record.

AI music scoring faces similar challenges. Despite tools like Suno that can quickly generate music from text or humming, these auto-composition tools lack fine-grained editing capabilities, making their output difficult to use as serious game scores. This is almost unacceptable for game music that demands atmospheric and thematic coherence. Currently, AI scoring serves more entertainment or prototyping purposes. In professional game projects, music remains human-composer-led, with AI unable to play a substantive role for now.

2.5 VLM applications in gaming

Regarding how to make AI understand game environments, in theory we could use vision language models (VLMs) to analyze game screenshots or video frames, converting them to text descriptions for AI decision-making. In practice, however, this approach proves highly inefficient and extravagant.

Vision models processing high-resolution game footage require substantial computational power, yet the extracted information is not necessarily accurate. Having AI infer spatial relationships and object states from pure pixels is extremely difficult with high error rates.

A more efficient approach leverages the game's own structured data. Developers can establish data tables and knowledge bases for important in-game elements, ensuring every NPC, item, and ability has clear descriptions. When AI needs to understand the current environment, it can directly call engine APIs to obtain structured information, then organize these query results into readable text or JSON format as prompts for the large model. This significantly improves decision accuracy and efficiency.

That said, VLMs do have their uses. For example, automated QA testing. Traditionally, game test scripts locate UI elements through screen coordinates for clicking, which easily fails across different resolutions and devices. Introducing visual recognition allows testing AI to understand on-screen text and icons, then operate according to instructions. This vision-semantic-based automated testing overcomes the limitations of traditional coordinate systems, substantially improving testing stability and efficiency across multiple terminals and resolutions.

3. AI in Narrative Design and Content Creation

Having AI participate in narrative design and content creation represents another major hotspot in gaming.

For instance, in open-world and sandbox games, people hope AI will bring richer, more varied narrative directions; in game production workflows, there's expectation that AI will automatically generate dialogue, quests, and other content to assist human designers. Yet the open-ended nature of AI-generated content introduces narrative uncontrollability that, if poorly managed, can invalidate a game's storytelling and undermine the player experience.

On this issue, the industry is exploring two main paths: system-level technical constraints and narrative design-level guidance.

3.1 System-level technical constraints

The first approach is rather radical but intriguing: making hallucinations reality.

That is, if AI fabricates an element, the game engine simply generates the corresponding item or event in real time, incorporating the AI's unexpected creativity into the game world. The benefit is creating surprise and dynamism for players, but the drawback is that while occasional surprises are fun, allowing AI to continuously add settings to the world eventually makes the entire game world chaotic. So if taking this path, clear boundary conditions must be set (e.g., important plot items cannot be AI-fabricated, or random events are limited to once per hour, etc.) to prevent the game from spiraling out of control.

The second approach is more engineering-oriented: adopting a Mixture-of-Experts (MoE) architecture.

The MoE concept involves housing multiple "expert" sub-models within a large model framework, with each expert specializing in different aspects. For example, one expert ensures protagonist dialogue aligns with character personality; another supervises worldbuilding consistency; yet another verifies accuracy in specific domain knowledge. During actual generation, a gating network intelligently调度 the weight of each expert model's contribution based on context. Through expert collaboration, AI-generated content maintains imaginative creativity without producing obviously incongruous or immersion-breaking results.

The third approach is using graph databases to assist with complex relationships.

In large open-world or role-playing games, NPCs may have intricate social relationships, while items and events form vast interconnected networks.

If all these relationships were crammed into the AI model in natural language for it to organize and understand, not only would token consumption be massive, but the model would easily "short-circuit" and produce hallucinatory false associations. To address this, developers increasingly rely on professional storage like graph databases to manage relational data. When AI requires relevant knowledge, it first queries the graph database for precise structured answers, then provides these results to the AI in text form. This approach is more efficient and accurate than having AI reason all relationships from scratch, while also preventing the model from arbitrarily fabricating connections.

3.2 Narrative design-level guidance

Another narrative-level approach, one adopted by content creation teams like Disney and DreamWorks, is key node convergence.

That is, rather than strictly constraining the process, lock in several key events. Specifically, writers (in human creation) pre-establish certain important nodes or climactic plot points in the story (e.g., "On July 7th, the protagonist must participate in a duel at a certain location"). Between these nodes, character and plot development can have considerable flexibility and freedom. But regardless of how divergent the process becomes, the key event that must trigger at the designated time will occur, pulling the story back onto the main track. This ensures the overall structure and theme don't go off-course while still enjoying the creative surprises AI brings to details and side plots.

A second narrative design approach is changing the "Why" rather than the "What."

That is, guiding AI to focus on generating causes and emotional changes in stories rather than the event outcomes themselves. Players often care more about sense of participation and narrative tension. Even before the AI era, Telltale Games (known for The Walking Dead) provided a model for interactive storytelling — regardless of player choices, the major plot events (What) remained largely unchanged, but these choices profoundly affected character relationships, motivations (Why), and emotional atmosphere.

Multiple games developed by Telltale Games

This gives players an experience of "my choices matter" without making the narrative tree infinitely branch and impossible to converge. When using AI for plot generation, emphasis can be placed on requiring AI to alter character attitudes, dialogue styles, emotional feedback, and other "soft changes" in response to player actions, while keeping key plot outcomes from derailing. This way, player experiences vary greatly (due to different emotional interactions), yet developers maintain control over the overall story direction.

3.3 Already viable directions

Industry practitioners have summarized several directions where AI-generated content has already proven viable in gaming:

1. Derivative and sandbox content: AI shows great promise for content expansion after the main storyline concludes. For example, opening an AI-driven sandbox world or releasing AI-generated mods after completing an RPG. Even if such derivative content quality slightly lags the main campaign, its value lies in providing endless freshness.

2. Game development art assets: AI is already widely applied in game art asset production, particularly for repetitive, relatively standardized assets like UI interfaces, small icons, and texture maps. Artists need only provide some prompts for AI to batch-produce style-unified icons, which then receive minor refinement before deployment.

3. Content marketing materials: This is one of the most important areas where many game companies already use AI. For social media marketing and promotion, massive quantities of short videos, posters, and copy are often needed. AI can generate creative short video scripts and art assets at scale based on game characters and narratives, then human curation and adjustment precedes publishing on Douyin, Twitter, and other platforms. Currently, many developers have incorporated AI copywriting and AI video generation as standardized pipeline components, using AI's high output to meet marketing's demands for content volume and frequency.

In summary, auxiliary content and repetitive assets are where AI creation shines. In these areas, human creators provide direction while AI handles "filling in" and "variations," effectively improving efficiency. For core narratives and character development — areas with extremely high creative demands — AI currently struggles to work independently but can serve in a supporting role.

4. The Current State and Future of the AI + Gaming Ecosystem

4.1 Game developers: Epic Games' practice

Epic Games recently showcased their latest achievements in integrating AI within the game engine ecosystem.

Most notably, they introduced a conversational AI character in Fortnite: first launching a speaking Darth Vader NPC that players could interact with through voice, with AI generating Darth Vader-style responses in real time based on player input.

Additionally, they announced a new tool called Persona Device, enabling all Fortnite creators to grant NPCs similar dialogue capabilities in custom maps. Through Fortnite's Unreal Editor, creators can set personality backgrounds, knowledge scopes, dialogue styles, and more for NPCs, making them unique interactive characters.

Beyond in-game NPCs, Epic is also introducing AI at the developer tools level to improve content production efficiency. They demonstrated Epic Developer Assistant, an AI chatbot assistant integrated into UEFN. Developers writing Fortnite scripts can ask questions anytime, similar to Copilot. Through this tool, novice creators can more quickly get up to speed with complex game engine scripting, substantially lowering the barrier to creation.

Epic's moves represent a broader trend of game developers evolving toward the AI era: officially demonstrating AI applications in actual games (enhancing NPC experiences) and opening corresponding technologies and tools to the developer community, allowing AI to truly begin integrating into the existing ecosystem.

4.2 UGC platforms: The Roblox case

Roblox released a 3D generation model called "Cube" and open-sourced it.

Cube's first function is 3D mesh generation. Developers need only input a description for AI to automatically generate corresponding 3D models, which can then be imported into Roblox Studio for further refinement to meet game requirements. This greatly lowers the barrier for UGC creators to produce models. Moreover, Roblox released Cube's underlying model as open source, allowing any developer to fine-tune it on their own data or create plugins, integrating it into their own projects.

Beyond 3D model generation, Roblox announced three other AI features: text generation (for narrative dialogue, etc.), text-to-speech (TTS), and speech-to-text (STT). Text generation allows developers to add interactive dialogue to NPCs, enabling natural language communication between players and characters; TTS facilitates instant voice acting for games; STT allows players to control games by voice. These tools are planned for release within months, further enhancing UGC creation efficiency and expressiveness.

Roblox's strategy represents the open, embracing attitude of large UGC platforms toward AI. For such platforms, user creation activity is the lifeblood of the platform, and AI is precisely the tool to amplify creative energy.

4.3 Changes on the player side: personalization opportunities and new audiences

For players, the deeper application of AI in games will bring two impacts: upgraded experiences for existing core players, and the emergence of entirely new player demographics.

In the core player market, AI's most direct current value to players still manifests through developer cost reduction and efficiency gains — faster development cycles and lower costs promise richer content updates and more stable game quality. Truly compelling AI games that can open core players' wallets still await further improvements in AI content quality.

One highly anticipated direction is extreme personalization. Taking otome games as an example, under traditional models all players share the same male leads and storylines, inevitably limiting immersion. Through AI, each player could theoretically have a unique virtual lover. This would address the greatest pain point in such games, delivering unprecedented immersion. Of course, to achieve this, AI-generated content quality must be sufficiently high for players to genuinely recognize the character before them as vivid and fully realized, rather than a formulaic machine script.

Because there currently exist player groups resistant to AI-generated content. Some core players spontaneously search for any suspected AI-created traces in games and expose them on forums. They view games as artistic works; if any portion is AI-generated rather than from designers' hands, it constitutes deception of players. This reminds developers that transparency and quality are crucial for AI content acceptance. If AI traces are too heavy, immediately recognizable as "machine-assembled," players easily lose immersion and resentment grows. Therefore, when presenting AI content to players, ensure its quality is at least convincing or sufficiently interesting; otherwise, better not to use it at all.

On the other hand, AI has potential to expand into entirely new gaming incremental markets, attracting people who previously didn't play games much. Imagine a game-like app that users open when anxious, where an AI character guides breathing exercises, listens to their concerns, or even generates a small story to comfort them based on their emotional state. This hardly qualifies as a traditional game, yet it satisfies people's emotional value needs. Similar products are already emerging. Such products will reach broad user groups seeking emotional resonance, potentially becoming new growth points for the gaming industry.

4.4 Future outlook: From AI coding to dual-engine models

Looking ahead 3-5 years, industry consensus holds that truly scaled game content AIGC (AI Generated Content) won't be a dramatic scenario where an end-to-end super AI model suddenly emerges one day to replace existing engines and tools. More likely is gradual integration where AI becomes an assistant or even agent for game developers themselves.

One vision is "AIGC through AI coding": cultivating a super AI agent that learns to use various existing development tools and resources like human programmers to make games. This AI could execute logic through game engines, generate art assets through DCC software, deploy servers through API calls, and so on — essentially a full-stack engineer versed in development workflows.

Before fully automated creation, we may experience an AI-UGC phase. For example, game publishers could introduce AI to dynamically generate some content based on player input, while collecting extensive player preference data and interaction prompts to continuously fine-tune and cultivate an AI model specialized for that game. Over time, this model's understanding and generation capabilities for that game grow stronger, even reaching the point of taking over some design and art work.

Another noteworthy trend is the "dual-engine" AI architecture. This refers to future game AI systems likely being composed of both traditional structured logic (rule engines, physics engines, behavior trees, etc.) and large language models, each playing to their strengths. On one hand, structured logic ensures the game world's consistency and hard rules remain unbroken (precisely where pure LLMs are weak, as LLMs struggle to strictly adhere to rule systems); on the other hand, LLM generation capabilities provide ever-changing dialogue, narrative, and other soft content. One can imagine that within future LLM frameworks, components or methods similar to ControlNet in image generation will emerge to impose constraints during generation, ensuring outputs meet game logic requirements.

5. Making It For Fun

AI technology and the gaming industry are always developing, with frontier directions and attempts emerging endlessly. But if you want to judge whether a game AI project is worth pursuing, one entrepreneur summarized four questions:

1. Can it do what humans cannot do?

2. Can it be more reliable than humans (fewer errors, more stable)?

3. Can it do better than humans (higher quality, better experience)?

4. If none of the above, can it do it cheaper?

If an AI application shows significant advantage in any of the above aspects, it holds value; otherwise, it may be mere gimmickry.

For example, AI generating countless side quests — humans can write these too, but AI may be cheaper and faster, which still carries meaning. But if AI output is neither novel nor quality, and costs more, neither players nor the market will buy in. Entrepreneurs should guard against the tendency of "AI for AI's sake."

The core of games is fun, not realism or infinite possibility. Excessive pursuit of the latter easily makes games boring.

After all, players play games hoping to interact and resonate with carefully crafted worlds, narratives, and gameplay designed by creators. If everything is left to AI's random associations, content lacking careful polish struggles to touch people's hearts.

Design sensibility and pacing remain the soul of gaming experience; AI should serve them, not overwhelm them.