Text-to-Speech
TTS
Text-to-Speech (TTS) refers to AI systems that convert written text into spoken audio. In the Chinese tech landscape covered by elsewhere, TTS has emerged as a competitive frontier where several startups and platforms are pushing capabilities forward.
MiniMax, a Yunqi Capital portfolio company, has drawn particular attention with its **MiniMax Speech 02** model released in May 2025, which ranked first on both the Artificial Analysis Speech Arena and Hugging Face TTS Arena benchmarks—surpassing OpenAI and ElevenLabs . The model supports 32 languages with zero-shot voice cloning, and Yunqi's reporting notes it undercuts ElevenLabs' pricing by half to a quarter while achieving better voice similarity scores across all tested languages . MiniMax's technical approach centers on a "learnable speaker encoder" that converts audio snippets into fixed-size conditional vectors, enabling flexible voice expression and cross-lingual transfer .
Beyond MiniMax, elsewhere's corpus points to TTS as a building block in broader product architectures. Vozo AI's founder Changyin Zhou described speech cloning and TTS as one of several converging technologies—alongside speech recognition, translation, and lip-sync—that matured enough in 2023-2024 to enable viable video translation products . Roblox has announced TTS and STT tools for its UGC platform to let developers voice NPCs and accept voice commands . RebyteAI bundles TTS with six voice options into its "lobster" AI agents as a native capability requiring no API configuration . Me.bot, the AI social network backed by Linear Capital, is deploying a "Talks" feature where users' identity models speak in their own voice through distributed H5 pages .
On the open-source side, Muyan-TTS from 沐言智语 was released with claims of the fastest inference among open models at 0.33 seconds per second of generated speech, alongside competitive word-error rates .
The gaming sector illustrates both TTS's traction and its limits. Monolith Capital's mid-2025 analysis noted that major studios still prefer self-developed TTS for personalized voice acting despite the cost, because off-the-shelf products fall short of bespoke requirements; meanwhile AI-generated sound effects and music remain largely unusable for professional game projects .
AI-generated — may contain errors, please verify.
Coverage
AI Will Change Education, But Education Won't Be Only AI | MonoX
The bottleneck in learning is no longer knowledge itself, but rather willingness and focused attention.
Monolith砺思资本·Vol.07 LLMs on Wheels: How "Attuned" Could Cars Get in 2025?
云启资本·2025 New Year Conversation with Koji: AI's Pivotal Year, the Dawn of the Agent Era | Seriously Speaking EP35
Stay optimistic about AI.
真格基金·Live from OpenAI's Developer Conference: GPT Dead Last? | Yunqi Capital Tech Talk
What's more important is...
云启资本·



