Skip to main content
BenchmarkSpeech & Audio AIv1.0

MusicCaps

by Agostinelli et al. / Google DeepMind · open-source · Last verified 2026-03-17

MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is used to evaluate text-to-music generation models and audio-language models on music captioning and retrieval tasks.

https://research.google/resources/datasets/musiccaps/
B
BAbove Average
Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F

Specifications

License
CC BY-SA 4.0
Pricing
open-source
Capabilities
evaluation, music-captioning, text-to-music-evaluation
Integrations
Use Cases
model-evaluation, audio-ai, generative-audio
API Available
No
Evaluated Models
musicgen-large, audioldm2, stable-audio
Metrics
fadref, clap-score, kl-divergence
Methodology
5,521 ten-second audio clips with expert captions. Text-to-music models generate audio from captions; evaluation uses Fréchet Audio Distance (FADref), CLAP similarity score, and KL divergence against reference audio features. Human evaluation supplements automated metrics.
Last Run
2026-01-18
Tags
music, audio-captioning, multimodal, text-to-music, evaluation
Added
2026-03-17
Completeness
100%

Index Score

64.3
Adoption
72
Quality
85
Freshness
78
Citations
74
Engagement
0

Explore the full AI ecosystem on Agents as a Service