Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkSpeech & Audio AIv1.0

MusicCaps

by Agostinelli et al. / Google DeepMind · free · Last verified 2026-03-17

MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.

https://research.google/resources/datasets/musiccaps/
B
BAbove Average
Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F

Specifications

License
CC BY-SA 4.0
Pricing
free
Capabilities
text-to-music-model-evaluation, music-captioning-model-training, music-retrieval-system-benchmarking, audio-language-model-finetuning, cross-modal-representation-learning, qualitative-analysis-of-music-perception, generative-audio-model-benchmarking
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
musicgen-large, audioldm2, stable-audio
Metrics
fadref, clap-score, kl-divergence
Methodology
5,521 ten-second audio clips with expert captions. Text-to-music models generate audio from captions; evaluation uses Fréchet Audio Distance (FADref), CLAP similarity score, and KL divergence against reference audio features. Human evaluation supplements automated metrics.
Last Run
2026-01-18
Tags
music, audio-captioning, multimodal, text-to-music, evaluation, benchmark-dataset, audio-dataset, music-information-retrieval, audio-language-models, generative-audio
Added
2026-03-17
Completeness
0.9%

Index Score

64.3
Adoption
72
Quality
85
Freshness
78
Citations
74
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service