Speech & Audio AI
TTS, STT, voice cloning, and audio processing
31 entities in this channel
LibriSpeech
by OpenSLR / Johns Hopkins University
LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.
LibriSpeech
by Panayotov et al. / Johns Hopkins
LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.
Whisper V3
by OpenAI
OpenAI's state-of-the-art open-source automatic speech recognition model trained on 680K hours of multilingual audio. Supports 99 languages with near-human accuracy and includes translation, timestamp, and language detection capabilities.
AudioSet
by Google
Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.
Common Voice
by Mozilla Foundation
Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.
VoxCeleb2
by Oxford Visual Geometry Group (VGG)
VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.
Common Voice 15
by Mozilla
Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.
ElevenLabs Turbo v2.5
by ElevenLabs
ElevenLabs Turbo v2.5 is a low-latency multilingual text-to-speech model optimized for real-time conversational AI applications, offering sub-400ms first-audio latency while maintaining the high voice cloning fidelity ElevenLabs is known for across 32 languages. It powers a wide range of AI assistant, customer service, and interactive voice applications where natural-sounding, real-time speech is critical.
Speech Recognition
by AaaS
Teaches integration and optimization of automatic speech recognition (ASR) systems — from Whisper to streaming cloud APIs — for agentic voice pipelines. Covers language identification, word error rate reduction, punctuation restoration, and handling noisy audio environments.
Speech-to-Text Pipeline
by OpenAI
Production-grade ASR pipeline using OpenAI Whisper or faster-whisper with VAD-based chunking, speaker timestamp alignment, and SRT/VTT subtitle export. Handles long-form audio via sliding window segmentation and automatic language detection.
ElevenLabs Conversational Agent
by ElevenLabs
ElevenLabs' conversational AI agent platform combining industry-leading voice synthesis with real-time dialogue capabilities. Supports 29+ languages, custom voice creation, and ultra-low-latency responses for natural phone and web interactions.
Google WaveNet
by Google / DeepMind
Google WaveNet is DeepMind's pioneering generative model for raw audio waveforms that dramatically advanced the state of the art in text-to-speech naturalness when published in 2016 and continues to power Google Assistant, Google Cloud TTS, and various Google products at massive scale. Its autoregressive waveform generation approach established the template for neural vocoder research and inspired a generation of TTS architectures.
Suno V3.5
by Suno AI
Suno V3.5 is a text-to-song AI model that generates complete, radio-quality music tracks with vocals, instrumentation, and song structure directly from natural language prompts or custom lyrics. It supports an enormous range of genres and styles and is widely regarded as the most accessible and highest-quality text-to-music system for non-musicians.
GigaSpeech
by Seasalt.ai / SpeechColab
GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.
Azure Neural TTS
by Microsoft
Azure Neural TTS is Microsoft's enterprise-grade text-to-speech service, part of Azure AI Speech. It provides 400+ natural-sounding voices across 140+ languages, with detailed prosody control via SSML. The service is designed for scalable applications, from accessibility tools to customer service bots.
MusicGen
by Meta AI
MusicGen is an open-source text-to-music model from Meta AI that generates high-quality instrumental music from text descriptions. It can also be conditioned on a melody reference, providing a strong, controllable baseline for both research and commercial applications, trained on 20K hours of licensed music.
MusicCaps
by Agostinelli et al. / Google DeepMind
MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.
MusicNet
by University of Washington
MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.
Vapi AI
by Vapi
Vapi AI is a developer-first platform for building and deploying real-time, conversational voice agents. It provides low-latency streaming, interruptible speech, and seamless integrations with various LLM, TTS, and STT providers. The platform is designed for developers to create sophisticated voice experiences with features like function calling and call analytics.
Speaker Diarization Script
by pyannote
This script automates the process of creating turn-by-turn transcripts from multi-speaker audio files. It first uses the pyannote.audio library to perform speaker diarization, identifying who spoke and when. These speaker segments are then aligned and merged with a transcription generated by OpenAI's Whisper, producing a final text output that attributes each line of dialogue to a specific speaker.
Hume AI
by Hume AI
Hume AI offers a toolkit and APIs for building emotionally intelligent applications. It analyzes human expression across voice, face, and language to measure nuanced emotions. Its Empathic Voice Interface (EVI) enables conversational agents to adapt their tone and prosody in real-time for more natural, empathetic interactions.
Bland AI
by Bland AI
Bland AI is an enterprise-grade AI phone agent platform designed for scalable inbound and outbound call automation. It features human-like conversational abilities, custom voice cloning, and dynamic call flows. The platform supports live call transfers and sentiment analysis to enhance customer interactions.
Voice Cloning Setup
by Coqui
Sets up a zero-shot voice cloning pipeline using Coqui XTTS-v2 or Tortoise-TTS, requiring only a 3-second reference audio clip to synthesize new speech in the target voice. Includes a FastAPI inference server, audio quality normalization, and speaker embedding export for reuse.
Speaker Diarization
by AaaS
Enables agents to segment audio recordings by speaker identity, answering 'who spoke when' for downstream summarization and analysis tasks. Covers embedding-based clustering (pyannote.audio, NeMo), overlapping speech handling, and merging diarization with ASR transcripts.
Voice Cloning
by AaaS
Teaches agents to synthesize speech in a target speaker's voice using few-shot and zero-shot voice cloning models, enabling personalized TTS experiences. Covers consent and ethical frameworks, reference audio quality requirements, model selection (ElevenLabs, Coqui XTTS, Tortoise), and anti-spoofing safeguards.
Retell AI
by Retell AI
Conversational voice AI platform purpose-built for call center automation. Delivers sub-second latency, natural turn-taking, and enterprise-grade reliability for handling millions of concurrent voice interactions.
Music Generation Script
by Meta AI
Generates royalty-free music from text prompts using Meta's MusicGen or AudioCraft, with controls for tempo, key, duration, and genre conditioning. Provides a CLI for batch generation and a streaming mode that writes 30-second chunks to disk or an S3 bucket.
Audio Classification
by AaaS
Trains agents to categorize audio clips into predefined classes — from environmental sound detection to music genre labeling and anomaly alerting. Covers mel-spectrogram feature extraction, audio-specific transformers (AST, Wav2Vec2), and zero-shot audio classification with CLAP.
Audio Classification Setup
by Community
Configures an audio classification system using Audio Spectrogram Transformer (AST) or YAMNet fine-tuned on AudioSet, with Mel spectrogram feature extraction and batch inference. Exports per-clip predictions with top-5 class probabilities and integrates with a streaming event bus for real-time use.
Music Generation Prompting
by AaaS
Covers structured prompting strategies for text-to-music models (MusicGen, Suno, Udio) to generate on-brand, mood-appropriate audio tracks at scale. Teaches tempo, key, instrumentation, and style descriptors alongside iterative regeneration and stem separation workflows.
Audio-Visual Alignment
by AaaS
Covers techniques for synchronizing and jointly representing audio and visual streams — from automatic lip-sync scoring and AV correspondence learning to temporal grounding of spoken words in video frames. Enables agents to build richer video understanding, dubbing validation, and accessibility captioning workflows.