Skip to main content
BenchmarkComputer Visionv1.0

Flickr30k

by Young et al. / University of Illinois · open-source · Last verified 2026-03-17

Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.

http://shannon.cs.illinois.edu/DenotationGraph/
B+
B+Good
Adoption: AQuality: AFreshness: C+Citations: AEngagement: F

Specifications

License
Custom (research)
Pricing
open-source
Capabilities
evaluation, image-text-retrieval, visual-grounding
Integrations
Use Cases
model-evaluation, computer-vision, multimodal-ai
API Available
No
Evaluated Models
clip-vit-l-14, align, blip-2, internvl2
Metrics
r1-image-to-text, r1-text-to-image, r5-image-to-text, r5-text-to-image
Methodology
1,000-image test set; each image paired with 5 captions. Bidirectional retrieval evaluated. R@K measures whether the correct item appears in the top-K retrieved results. Average recall across directions is the composite metric.
Last Run
2025-12-05
Tags
image-captioning, visual-grounding, retrieval, cross-modal, recall
Added
2026-03-17
Completeness
100%

Index Score

70.9
Adoption
82
Quality
83
Freshness
56
Citations
86
Engagement
0

Explore the full AI ecosystem on Agents as a Service