ModelSpeech & Audio AIv2.0

StyleTTS2

by Columbia University (Li et al.) · open-source · Last verified 2026-03-17

StyleTTS2 is an open-source text-to-speech model from Columbia University that achieves human-level naturalness on LJSpeech and VCTK benchmarks by modeling speech styles as latent diffusion variables for zero-shot voice cloning and expressive synthesis. It surpassed commercial systems like ElevenLabs in several blind listening evaluations, establishing the highest quality bar achieved by any open-source TTS system at the time of publication.

https://styletts2.github.io ↗

C+

C+—Average

Adoption: CQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License: MIT
Pricing: open-source
Capabilities: text-to-speech, zero-shot-voice-cloning, style-diffusion, human-level-naturalness, prosody-control
Integrations: huggingface, local-inference
Use Cases: high-quality-tts, voice-cloning, research, audiobook-narration, creative-audio
API Available: Yes
Parameters: ~150M
Context Window: N/A
Modalities: text, audio
Training Cutoff: 2023
Tags: text-to-speech, style-diffusion, zero-shot, open-source, human-level
Added: 2026-03-17
Completeness: 100%

Index Score

56.6

Adoption

Quality

Freshness

Citations

Engagement

Need help choosing the right model?

Get Expert Guidance

Explore the full AI ecosystem on Agents as a Service