MiniMax’s Speech – 02 Voice Model Tops Two Lists

In the two most authoritative international speech evaluation lists, Artificial Analysis and Hugging Face TTS Arena, Shanghai’s AI unicorn MiniMax has topped the lists with its new-generation speech large model, Speech-02.

International Authoritative Speech Evaluation List: Artificial Analysis

Hugging Face TTS Arena Evaluation List

In terms of technical indicators, it has achieved state-of-the-art (SOTA) results in objective indicators such as the Word Error Rate (WER) and Similarity (SIM). The subjective evaluation feedback from users in blind listening tests also shows that the generated speech is more natural and realistic. Specifically, compared with Seed-TTS, CosyVoice 2, and real audio, Speech-02 has achieved a lower WER in zero-shot speech cloning for both Chinese and English, indicating that it has a lower pronunciation error rate and is clearer and more stable. In terms of SIM, Speech-02 significantly outperforms ElevenLabs’ multilingual_v2 model in all 24 tested languages, and the speech generated by the former is closer to that of a real person.

It is worth mentioning that the commercial pricing of Speech-02 is only one-fourth of that of the global leading model, ElevenLabs. With the dual advantages of ultra-high performance and extremely high cost-effectiveness, it promotes the large-scale commercial implementation of domestic AI speech technology.

This means that small and medium-sized companies no longer have to worry about the expensive speech technology. Industries such as intelligent customer service, voice interaction, and AI education can directly “take off on the spot”. Currently, MiniMax has reached cooperation agreements with many domestic enterprises, including China Reading Group and Gaotu Education. It has even created new applications in hardware scenarios such as AI toys and intelligent automotive cockpits.

In terms of “linguistic talent”, it can seamlessly switch between 32 languages, mastering both dialects and minority languages. Linda, the head of MiniMax’s overseas ecosystem, introduced, “The newly released Speech-02 can easily handle different accents and emotions in 32 languages. We believe that through AI and with support for rare minority languages, in the future, multilingual voices can be transmitted around the world with the most authentic local pronunciations, helping every language in the world to be heard and every culture to be understood.”

From Speech-01, which supported 17 languages at the beginning of the year, to Speech-02, which covers 32 languages now, MiniMax has completed a double leap of “technical iteration + global implementation” in just a few months. Behind this “Chinese speed” is the dual-wheel drive of Chinese AI enterprises in “technology + business” – not only can they develop world-class technologies, but they can also quickly turn technology into tangible productivity.

From speech cloning to multilingual switching, from technical leadership to affordable pricing, the emergence of Speech-02 is not only a victory for MiniMax but also another “breakthrough” for Chinese AI. Chinese companies are redefining industry rules with technology and strength.

Chinatimenow