MusicCaps
by Agostinelli et al. / Google DeepMind · open-source · Last verified 2026-03-17
MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is used to evaluate text-to-music generation models and audio-language models on music captioning and retrieval tasks.
https://research.google/resources/datasets/musiccaps/ ↗B
B—Above Average
Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F
Specifications
- License
- CC BY-SA 4.0
- Pricing
- open-source
- Capabilities
- evaluation, music-captioning, text-to-music-evaluation
- Integrations
- Use Cases
- model-evaluation, audio-ai, generative-audio
- API Available
- No
- Evaluated Models
- musicgen-large, audioldm2, stable-audio
- Metrics
- fadref, clap-score, kl-divergence
- Methodology
- 5,521 ten-second audio clips with expert captions. Text-to-music models generate audio from captions; evaluation uses Fréchet Audio Distance (FADref), CLAP similarity score, and KL divergence against reference audio features. Human evaluation supplements automated metrics.
- Last Run
- 2026-01-18
- Tags
- music, audio-captioning, multimodal, text-to-music, evaluation
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
64.3Adoption
72
Quality
85
Freshness
78
Citations
74
Engagement
0