Flickr30k
by Young et al. / University of Illinois · open-source · Last verified 2026-03-17
Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.
http://shannon.cs.illinois.edu/DenotationGraph/ ↗B+
B+—Good
Adoption: AQuality: AFreshness: C+Citations: AEngagement: F
Specifications
- License
- Custom (research)
- Pricing
- open-source
- Capabilities
- evaluation, image-text-retrieval, visual-grounding
- Integrations
- Use Cases
- model-evaluation, computer-vision, multimodal-ai
- API Available
- No
- Evaluated Models
- clip-vit-l-14, align, blip-2, internvl2
- Metrics
- r1-image-to-text, r1-text-to-image, r5-image-to-text, r5-text-to-image
- Methodology
- 1,000-image test set; each image paired with 5 captions. Bidirectional retrieval evaluated. R@K measures whether the correct item appears in the top-K retrieved results. Average recall across directions is the composite metric.
- Last Run
- 2025-12-05
- Tags
- image-captioning, visual-grounding, retrieval, cross-modal, recall
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
70.9Adoption
82
Quality
83
Freshness
56
Citations
86
Engagement
0