GigaSpeech
by Seasalt.ai / SpeechColab · open-source · Last verified 2026-03-17
GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.
https://github.com/SpeechColab/GigaSpeech ↗B
B—Above Average
Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- automatic-speech-recognition, multi-domain-asr, robust-asr
- Integrations
- HuggingFace Datasets, ESPnet, Kaldi
- Use Cases
- model-training, benchmark, domain-robust-asr
- API Available
- No
- Tags
- ASR, large-scale, english, multi-domain, podcasts, audiobooks, youtube
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
67.7Adoption
76
Quality
89
Freshness
78
Citations
78
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.