BenchmarkComputer Visionv1.0

Flickr30k

by Young et al. / University of Illinois · open-source · Last verified 2026-03-17

Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.

http://shannon.cs.illinois.edu/DenotationGraph/ ↗

C—Below Average

Adoption: AQuality: AFreshness: C+Citations: FEngagement: F

Specifications

License: Custom (research)
Pricing: open-source
Capabilities: evaluation, image-text-retrieval, visual-grounding
Integrations
Use Cases: model-evaluation, computer-vision, multimodal-ai
API Available: No
Evaluated Models: clip-vit-l-14, align, blip-2, internvl2
Metrics: r1-image-to-text, r1-text-to-image, r5-image-to-text, r5-text-to-image
Methodology: 1,000-image test set; each image paired with 5 captions. Bidirectional retrieval evaluated. R@K measures whether the correct item appears in the top-K retrieved results. Average recall across directions is the composite metric.
Last Run: 2025-12-05
Tags: image-captioning, visual-grounding, retrieval, cross-modal, recall
Added: 2026-03-17
Completeness: 80%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service