MusicCaps
by Agostinelli et al. / Google DeepMind · free · Last verified 2026-03-17
MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.
https://research.google/resources/datasets/musiccaps/ ↗B
B—Above Average
Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F
Specifications
- License
- CC BY-SA 4.0
- Pricing
- free
- Capabilities
- text-to-music-model-evaluation, music-captioning-model-training, music-retrieval-system-benchmarking, audio-language-model-finetuning, cross-modal-representation-learning, qualitative-analysis-of-music-perception, generative-audio-model-benchmarking
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- musicgen-large, audioldm2, stable-audio
- Metrics
- fadref, clap-score, kl-divergence
- Methodology
- 5,521 ten-second audio clips with expert captions. Text-to-music models generate audio from captions; evaluation uses Fréchet Audio Distance (FADref), CLAP similarity score, and KL divergence against reference audio features. Human evaluation supplements automated metrics.
- Last Run
- 2026-01-18
- Tags
- music, audio-captioning, multimodal, text-to-music, evaluation, benchmark-dataset, audio-dataset, music-information-retrieval, audio-language-models, generative-audio
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
64.3Adoption
72
Quality
85
Freshness
78
Citations
74
Engagement
0