StyleTTS2
by Columbia University (Li et al.) · open-source · Last verified 2026-03-17
StyleTTS2 is an open-source text-to-speech model from Columbia University that achieves human-level naturalness on LJSpeech and VCTK benchmarks by modeling speech styles as latent diffusion variables for zero-shot voice cloning and expressive synthesis. It surpassed commercial systems like ElevenLabs in several blind listening evaluations, establishing the highest quality bar achieved by any open-source TTS system at the time of publication.
https://styletts2.github.io ↗C+
C+—Average
Adoption: CQuality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- text-to-speech, zero-shot-voice-cloning, style-diffusion, human-level-naturalness, prosody-control
- Integrations
- huggingface, local-inference
- Use Cases
- high-quality-tts, voice-cloning, research, audiobook-narration, creative-audio
- API Available
- Yes
- Parameters
- ~150M
- Context Window
- N/A
- Modalities
- text, audio
- Training Cutoff
- 2023
- Tags
- text-to-speech, style-diffusion, zero-shot, open-source, human-level
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
56.6Adoption
48
Quality
93
Freshness
70
Citations
75
Engagement
0