Product

Text-to-Speech

TTS

Text-to-Speech (TTS) refers to AI systems that convert written text into spoken audio. In the Chinese tech landscape covered by elsewhere, TTS has emerged as a competitive frontier where several startups and platforms are pushing capabilities forward.

MiniMax, a Yunqi Capital portfolio company, has drawn particular attention with its **MiniMax Speech 02** model released in May 2025, which ranked first on both the Artificial Analysis Speech Arena and Hugging Face TTS Arena benchmarks—surpassing OpenAI and ElevenLabs . The model supports 32 languages with zero-shot voice cloning, and Yunqi's reporting notes it undercuts ElevenLabs' pricing by half to a quarter while achieving better voice similarity scores across all tested languages . MiniMax's technical approach centers on a "learnable speaker encoder" that converts audio snippets into fixed-size conditional vectors, enabling flexible voice expression and cross-lingual transfer .

Beyond MiniMax, elsewhere's corpus points to TTS as a building block in broader product architectures. Vozo AI's founder Changyin Zhou described speech cloning and TTS as one of several converging technologies—alongside speech recognition, translation, and lip-sync—that matured enough in 2023-2024 to enable viable video translation products . Roblox has announced TTS and STT tools for its UGC platform to let developers voice NPCs and accept voice commands . RebyteAI bundles TTS with six voice options into its "lobster" AI agents as a native capability requiring no API configuration . Me.bot, the AI social network backed by Linear Capital, is deploying a "Talks" feature where users' identity models speak in their own voice through distributed H5 pages .

On the open-source side, Muyan-TTS from 沐言智语 was released with claims of the fastest inference among open models at 0.33 seconds per second of generated speech, alongside competitive word-error rates .

The gaming sector illustrates both TTS's traction and its limits. Monolith Capital's mid-2025 analysis noted that major studios still prefer self-developed TTS for personalized voice acting despite the cost, because off-the-shelf products fall short of bespoke requirements; meanwhile AI-generated sound effects and music remain largely unusable for professional game projects .

AI-generated — may contain errors, please verify.

Text-to-SpeechProduct
TTS
No graph yet
Mentioned in 4 articles

Coverage