DatasetComputer Visionv2.0

WebVid-10M

by University of Oxford · free · Last verified 2026-03-17

WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.

https://m-bain.github.io/webvid-dataset/ ↗

B—Above Average

Adoption: BQuality: AFreshness: C+Citations: B+Engagement: F

Specifications

License: Custom
Pricing: free
Capabilities: video-language model pretraining, text-to-video retrieval, video-to-text generation (captioning), zero-shot video classification, temporal reasoning, action recognition, video question answering (VQA), multimodal representation learning
Integrations: [object Object], [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: No
Tags: multimodal, video-text, video-captioning, large-scale, pretraining, video-understanding, computer-vision, nlp, text-to-video-retrieval, temporal-reasoning
Added: 2026-03-17
Completeness: 0.85%

Index Score

62.7

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service