Skip to main content
DatasetComputer Visionv2.0

WebVid-10M

by University of Oxford · open-source · Last verified 2026-03-17

A large-scale dataset of 10.7 million video-text pairs scraped from the web, where stock video clips are paired with their descriptive alt-text captions. WebVid-10M is the primary pretraining corpus for video-language models such as Frozen-in-Time and VideoCLIP, enabling open research into video understanding, temporal reasoning, and video-to-text generation.

https://m-bain.github.io/webvid-dataset/
B
BAbove Average
Adoption: BQuality: AFreshness: C+Citations: B+Engagement: F

Specifications

License
Custom
Pricing
open-source
Capabilities
video-language-pretraining, video-captioning, temporal-understanding
Integrations
hugging-face
Use Cases
video-language-pretraining, video-captioning, research
API Available
No
Tags
multimodal, video-text, video-captioning, large-scale, pretraining
Added
2026-03-17
Completeness
100%

Index Score

62.7
Adoption
68
Quality
80
Freshness
50
Citations
78
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service