Skip to main content
brand
context
industry
strategy
AaaS
Channel

Speech & Audio AI

TTS, STT, voice cloning, and audio processing

31 entities in this channel

DatasetSpeech & Audio AI

LibriSpeech

by OpenSLR / Johns Hopkins University

LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.

automatic-speech-recognitionASRenglish
80.2A
BenchmarkSpeech & Audio AI

LibriSpeech

by Panayotov et al. / Johns Hopkins

LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.

asrspeech-recognitionenglish
79B+
ModelSpeech & Audio AI

Whisper V3

by OpenAI

OpenAI's state-of-the-art open-source automatic speech recognition model trained on 680K hours of multilingual audio. Supports 99 languages with near-human accuracy and includes translation, timestamp, and language detection capabilities.

speech-to-texttranscriptionmultilingual
77B+
DatasetSpeech & Audio AI

AudioSet

by Google

Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.

audio-classificationsound-eventslarge-scale
73.7B+
BenchmarkSpeech & Audio AI

Common Voice

by Mozilla Foundation

Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.

asrmultilingualcrowdsourced
73.5B+
DatasetSpeech & Audio AI

VoxCeleb2

by Oxford Visual Geometry Group (VGG)

VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.

speaker-verificationspeaker-recognitionin-the-wild
73B+
DatasetSpeech & Audio AI

Common Voice 15

by Mozilla

Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.

ASRmultilingualcrowdsourced
72.6B+
ModelSpeech & Audio AI

ElevenLabs Turbo v2.5

by ElevenLabs

ElevenLabs Turbo v2.5 is a low-latency multilingual text-to-speech model optimized for real-time conversational AI applications, offering sub-400ms first-audio latency while maintaining the high voice cloning fidelity ElevenLabs is known for across 32 languages. It powers a wide range of AI assistant, customer service, and interactive voice applications where natural-sounding, real-time speech is critical.

text-to-speechvoice-cloninglow-latency
72.4B+
SkillSpeech & Audio AI

Speech Recognition

by AaaS

Teaches integration and optimization of automatic speech recognition (ASR) systems — from Whisper to streaming cloud APIs — for agentic voice pipelines. Covers language identification, word error rate reduction, punctuation restoration, and handling noisy audio environments.

asrwhispertranscription
71.9B+
ScriptSpeech & Audio AI

Speech-to-Text Pipeline

by OpenAI

Production-grade ASR pipeline using OpenAI Whisper or faster-whisper with VAD-based chunking, speaker timestamp alignment, and SRT/VTT subtitle export. Handles long-form audio via sliding window segmentation and automatic language detection.

speech-to-textwhispertranscription
71.4B+
AgentSpeech & Audio AI

ElevenLabs Conversational Agent

by ElevenLabs

ElevenLabs' conversational AI agent platform combining industry-leading voice synthesis with real-time dialogue capabilities. Supports 29+ languages, custom voice creation, and ultra-low-latency responses for natural phone and web interactions.

voice-agenttext-to-speechvoice-cloning
71.1B+
ModelSpeech & Audio AI

Google WaveNet

by Google / DeepMind

Google WaveNet is DeepMind's pioneering generative model for raw audio waveforms that dramatically advanced the state of the art in text-to-speech naturalness when published in 2016 and continues to power Google Assistant, Google Cloud TTS, and various Google products at massive scale. Its autoregressive waveform generation approach established the template for neural vocoder research and inspired a generation of TTS architectures.

text-to-speechwavenetgoogle
70.5B+
ModelSpeech & Audio AI

Suno V3.5

by Suno AI

Suno V3.5 is a text-to-song AI model that generates complete, radio-quality music tracks with vocals, instrumentation, and song structure directly from natural language prompts or custom lyrics. It supports an enormous range of genres and styles and is widely regarded as the most accessible and highest-quality text-to-music system for non-musicians.

music-generationtext-to-musicvocals
69.4B
DatasetSpeech & Audio AI

GigaSpeech

by Seasalt.ai / SpeechColab

GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.

ASRlarge-scaleenglish
67.7B
ModelSpeech & Audio AI

Azure Neural TTS

by Microsoft

Azure Neural TTS is Microsoft's enterprise-grade text-to-speech service, part of Azure AI Speech. It provides 400+ natural-sounding voices across 140+ languages, with detailed prosody control via SSML. The service is designed for scalable applications, from accessibility tools to customer service bots.

text-to-speechneural-ttsazure-ai
67.2B
ModelSpeech & Audio AI

MusicGen

by Meta AI

MusicGen is an open-source text-to-music model from Meta AI that generates high-quality instrumental music from text descriptions. It can also be conditioned on a melody reference, providing a strong, controllable baseline for both research and commercial applications, trained on 20K hours of licensed music.

music-generationtext-to-musicopen-source
64.5B
BenchmarkSpeech & Audio AI

MusicCaps

by Agostinelli et al. / Google DeepMind

MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.

musicaudio-captioningmultimodal
64.3B
DatasetSpeech & Audio AI

MusicNet

by University of Washington

MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.

musicinstrument-recognitionnote-annotations
63.2B
AgentSpeech & Audio AI

Vapi AI

by Vapi

Vapi AI is a developer-first platform for building and deploying real-time, conversational voice agents. It provides low-latency streaming, interruptible speech, and seamless integrations with various LLM, TTS, and STT providers. The platform is designed for developers to create sophisticated voice experiences with features like function calling and call analytics.

voice-aivoice-agentvoice-api
62.6B
ScriptSpeech & Audio AI

Speaker Diarization Script

by pyannote

This script automates the process of creating turn-by-turn transcripts from multi-speaker audio files. It first uses the pyannote.audio library to perform speaker diarization, identifying who spoke and when. These speaker segments are then aligned and merged with a transcription generated by OpenAI's Whisper, producing a final text output that attributes each line of dialogue to a specific speaker.

speaker-diarizationaudio-processingtranscription
60.4B
AgentSpeech & Audio AI

Hume AI

by Hume AI

Hume AI offers a toolkit and APIs for building emotionally intelligent applications. It analyzes human expression across voice, face, and language to measure nuanced emotions. Its Empathic Voice Interface (EVI) enables conversational agents to adapt their tone and prosody in real-time for more natural, empathetic interactions.

emotion-aivoice-agentempathic-computing
59.6C+
AgentSpeech & Audio AI

Bland AI

by Bland AI

Bland AI is an enterprise-grade AI phone agent platform designed for scalable inbound and outbound call automation. It features human-like conversational abilities, custom voice cloning, and dynamic call flows. The platform supports live call transfers and sentiment analysis to enhance customer interactions.

voice-agentphone-callsconversational-ai
58.7C+
ScriptSpeech & Audio AI

Voice Cloning Setup

by Coqui

Sets up a zero-shot voice cloning pipeline using Coqui XTTS-v2 or Tortoise-TTS, requiring only a 3-second reference audio clip to synthesize new speech in the target voice. Includes a FastAPI inference server, audio quality normalization, and speaker embedding export for reuse.

voice-cloningttscoqui
57.4C+
SkillSpeech & Audio AI

Speaker Diarization

by AaaS

Enables agents to segment audio recordings by speaker identity, answering 'who spoke when' for downstream summarization and analysis tasks. Covers embedding-based clustering (pyannote.audio, NeMo), overlapping speech handling, and merging diarization with ASR transcripts.

diarizationspeaker-idaudio
57.4C+
SkillSpeech & Audio AI

Voice Cloning

by AaaS

Teaches agents to synthesize speech in a target speaker's voice using few-shot and zero-shot voice cloning models, enabling personalized TTS experiences. Covers consent and ethical frameworks, reference audio quality requirements, model selection (ElevenLabs, Coqui XTTS, Tortoise), and anti-spoofing safeguards.

ttsvoice-cloningeleven-labs
57C+
AgentSpeech & Audio AI

Retell AI

by Retell AI

Conversational voice AI platform purpose-built for call center automation. Delivers sub-second latency, natural turn-taking, and enterprise-grade reliability for handling millions of concurrent voice interactions.

voice-agentconversational-aicall-center
56.5C+
ScriptSpeech & Audio AI

Music Generation Script

by Meta AI

Generates royalty-free music from text prompts using Meta's MusicGen or AudioCraft, with controls for tempo, key, duration, and genre conditioning. Provides a CLI for batch generation and a streaming mode that writes 30-second chunks to disk or an S3 bucket.

music-generationaudiocraftmusicgen
53.8C+
SkillSpeech & Audio AI

Audio Classification

by AaaS

Trains agents to categorize audio clips into predefined classes — from environmental sound detection to music genre labeling and anomaly alerting. Covers mel-spectrogram feature extraction, audio-specific transformers (AST, Wav2Vec2), and zero-shot audio classification with CLAP.

audioclassificationsound-event-detection
52.6C+
ScriptSpeech & Audio AI

Audio Classification Setup

by Community

Configures an audio classification system using Audio Spectrogram Transformer (AST) or YAMNet fine-tuned on AudioSet, with Mel spectrogram feature extraction and batch inference. Exports per-clip predictions with top-5 class probabilities and integrates with a streaming event bus for real-time use.

audio-classificationsound-eventsast
50.8C+
SkillSpeech & Audio AI

Music Generation Prompting

by AaaS

Covers structured prompting strategies for text-to-music models (MusicGen, Suno, Udio) to generate on-brand, mood-appropriate audio tracks at scale. Teaches tempo, key, instrumentation, and style descriptors alongside iterative regeneration and stem separation workflows.

musicgenerationaudiocraft
47.8C
SkillSpeech & Audio AI

Audio-Visual Alignment

by AaaS

Covers techniques for synchronizing and jointly representing audio and visual streams — from automatic lip-sync scoring and AV correspondence learning to temporal grounding of spoken words in video frames. Enables agents to build richer video understanding, dubbing validation, and accessibility captioning workflows.

multimodalav-synclip-sync
43.6C