Speech &amp; Audio AI

speech-to-texttranscriptionmultilingual

Whisper V3

by OpenAI

OpenAI's state-of-the-art open-source automatic speech recognition model trained on 680K hours of multilingual audio. Supports 99 languages with near-human accuracy and includes translation, timestamp, and language detection capabilities.

77B+

audio-classificationsound-eventslarge-scale

AudioSet

by Google

Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.

73.7B+

BenchmarkSpeech & Audio AI

Common Voice

by Mozilla Foundation

Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.

asrmultilingualcrowdsourced

73.5B+

speaker-verificationspeaker-recognitionin-the-wild

VoxCeleb2

by Oxford Visual Geometry Group (VGG)

VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.

73B+

ASRmultilingualcrowdsourced

Common Voice 15

by Mozilla

Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.

72.6B+

text-to-speechvoice-cloninglow-latency

ElevenLabs Turbo v2.5

by ElevenLabs

ElevenLabs Turbo v2.5 is a low-latency multilingual text-to-speech model optimized for real-time conversational AI applications, offering sub-400ms first-audio latency while maintaining the high voice cloning fidelity ElevenLabs is known for across 32 languages. It powers a wide range of AI assistant, customer service, and interactive voice applications where natural-sounding, real-time speech is critical.

72.4B+

Speech Recognition

by AaaS

Teaches integration and optimization of automatic speech recognition (ASR) systems — from Whisper to streaming cloud APIs — for agentic voice pipelines. Covers language identification, word error rate reduction, punctuation restoration, and handling noisy audio environments.

asrwhispertranscription

71.9B+

speech-to-textwhispertranscription

Speech-to-Text Pipeline

by OpenAI

Production-grade ASR pipeline using OpenAI Whisper or faster-whisper with VAD-based chunking, speaker timestamp alignment, and SRT/VTT subtitle export. Handles long-form audio via sliding window segmentation and automatic language detection.

71.4B+

voice-agenttext-to-speechvoice-cloning

ElevenLabs Conversational Agent

by ElevenLabs

ElevenLabs' conversational AI agent platform combining industry-leading voice synthesis with real-time dialogue capabilities. Supports 29+ languages, custom voice creation, and ultra-low-latency responses for natural phone and web interactions.

71.1B+

text-to-speechwavenetgoogle

Google WaveNet

by Google / DeepMind

Google WaveNet is DeepMind's pioneering generative model for raw audio waveforms that dramatically advanced the state of the art in text-to-speech naturalness when published in 2016 and continues to power Google Assistant, Google Cloud TTS, and various Google products at massive scale. Its autoregressive waveform generation approach established the template for neural vocoder research and inspired a generation of TTS architectures.

70.5B+

music-generationtext-to-musicvocals

Suno V3.5

by Suno AI

Suno V3.5 is a text-to-song AI model that generates complete, radio-quality music tracks with vocals, instrumentation, and song structure directly from natural language prompts or custom lyrics. It supports an enormous range of genres and styles and is widely regarded as the most accessible and highest-quality text-to-music system for non-musicians.

69.4B

GigaSpeech

by Seasalt.ai / SpeechColab

GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.

ASRlarge-scaleenglish

67.7B

text-to-speechneural-ttsazure-ai

Azure Neural TTS

by Microsoft

Azure Neural TTS is Microsoft's enterprise-grade text-to-speech service, part of Azure AI Speech. It provides 400+ natural-sounding voices across 140+ languages, with detailed prosody control via SSML. The service is designed for scalable applications, from accessibility tools to customer service bots.

67.2B

music-generationtext-to-musicopen-source

MusicGen

by Meta AI

MusicGen is an open-source text-to-music model from Meta AI that generates high-quality instrumental music from text descriptions. It can also be conditioned on a melody reference, providing a strong, controllable baseline for both research and commercial applications, trained on 20K hours of licensed music.

64.5B

BenchmarkSpeech & Audio AI

MusicCaps

by Agostinelli et al. / Google DeepMind

MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.

musicaudio-captioningmultimodal

64.3B

musicinstrument-recognitionnote-annotations

MusicNet

by University of Washington

MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.

63.2B

voice-aivoice-agentvoice-api

Vapi AI

by Vapi

Vapi AI is a developer-first platform for building and deploying real-time, conversational voice agents. It provides low-latency streaming, interruptible speech, and seamless integrations with various LLM, TTS, and STT providers. The platform is designed for developers to create sophisticated voice experiences with features like function calling and call analytics.

62.6B

speaker-diarizationaudio-processingtranscription

Speaker Diarization Script

by pyannote

This script automates the process of creating turn-by-turn transcripts from multi-speaker audio files. It first uses the pyannote.audio library to perform speaker diarization, identifying who spoke and when. These speaker segments are then aligned and merged with a transcription generated by OpenAI's Whisper, producing a final text output that attributes each line of dialogue to a specific speaker.

60.4B

emotion-aivoice-agentempathic-computing

Hume AI

by Hume AI

Hume AI offers a toolkit and APIs for building emotionally intelligent applications. It analyzes human expression across voice, face, and language to measure nuanced emotions. Its Empathic Voice Interface (EVI) enables conversational agents to adapt their tone and prosody in real-time for more natural, empathetic interactions.

59.6C+

voice-agentphone-callsconversational-ai

Bland AI

by Bland AI

Bland AI is an enterprise-grade AI phone agent platform designed for scalable inbound and outbound call automation. It features human-like conversational abilities, custom voice cloning, and dynamic call flows. The platform supports live call transfers and sentiment analysis to enhance customer interactions.

58.7C+

Voice Cloning Setup

by Coqui

Sets up a zero-shot voice cloning pipeline using Coqui XTTS-v2 or Tortoise-TTS, requiring only a 3-second reference audio clip to synthesize new speech in the target voice. Includes a FastAPI inference server, audio quality normalization, and speaker embedding export for reuse.

voice-cloningttscoqui

57.4C+

diarizationspeaker-idaudio

Speaker Diarization

by AaaS

Enables agents to segment audio recordings by speaker identity, answering 'who spoke when' for downstream summarization and analysis tasks. Covers embedding-based clustering (pyannote.audio, NeMo), overlapping speech handling, and merging diarization with ASR transcripts.

57.4C+

ttsvoice-cloningeleven-labs

Voice Cloning

by AaaS

Teaches agents to synthesize speech in a target speaker's voice using few-shot and zero-shot voice cloning models, enabling personalized TTS experiences. Covers consent and ethical frameworks, reference audio quality requirements, model selection (ElevenLabs, Coqui XTTS, Tortoise), and anti-spoofing safeguards.

57C+

voice-agentconversational-aicall-center

Retell AI

by Retell AI

Conversational voice AI platform purpose-built for call center automation. Delivers sub-second latency, natural turn-taking, and enterprise-grade reliability for handling millions of concurrent voice interactions.

56.5C+

music-generationaudiocraftmusicgen

Music Generation Script

by Meta AI

Generates royalty-free music from text prompts using Meta's MusicGen or AudioCraft, with controls for tempo, key, duration, and genre conditioning. Provides a CLI for batch generation and a streaming mode that writes 30-second chunks to disk or an S3 bucket.

53.8C+

audioclassificationsound-event-detection

Audio Classification

by AaaS

Trains agents to categorize audio clips into predefined classes — from environmental sound detection to music genre labeling and anomaly alerting. Covers mel-spectrogram feature extraction, audio-specific transformers (AST, Wav2Vec2), and zero-shot audio classification with CLAP.

52.6C+

audio-classificationsound-eventsast

Audio Classification Setup

by Community

Configures an audio classification system using Audio Spectrogram Transformer (AST) or YAMNet fine-tuned on AudioSet, with Mel spectrogram feature extraction and batch inference. Exports per-clip predictions with top-5 class probabilities and integrates with a streaming event bus for real-time use.

50.8C+

musicgenerationaudiocraft

Music Generation Prompting

by AaaS

Covers structured prompting strategies for text-to-music models (MusicGen, Suno, Udio) to generate on-brand, mood-appropriate audio tracks at scale. Teaches tempo, key, instrumentation, and style descriptors alongside iterative regeneration and stem separation workflows.

47.8C