WebVid-10M
by University of Oxford · open-source · Last verified 2026-03-17
A large-scale dataset of 10.7 million video-text pairs scraped from the web, where stock video clips are paired with their descriptive alt-text captions. WebVid-10M is the primary pretraining corpus for video-language models such as Frozen-in-Time and VideoCLIP, enabling open research into video understanding, temporal reasoning, and video-to-text generation.
https://m-bain.github.io/webvid-dataset/ ↗B
B—Above Average
Adoption: BQuality: AFreshness: C+Citations: B+Engagement: F
Specifications
- License
- Custom
- Pricing
- open-source
- Capabilities
- video-language-pretraining, video-captioning, temporal-understanding
- Integrations
- hugging-face
- Use Cases
- video-language-pretraining, video-captioning, research
- API Available
- No
- Tags
- multimodal, video-text, video-captioning, large-scale, pretraining
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
62.7Adoption
68
Quality
80
Freshness
50
Citations
78
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.