WebVid-10M
by University of Oxford · free · Last verified 2026-03-17
WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.
https://m-bain.github.io/webvid-dataset/ ↗B
B—Above Average
Adoption: BQuality: AFreshness: C+Citations: B+Engagement: F
Specifications
- License
- Custom
- Pricing
- free
- Capabilities
- video-language model pretraining, text-to-video retrieval, video-to-text generation (captioning), zero-shot video classification, temporal reasoning, action recognition, video question answering (VQA), multimodal representation learning
- Integrations
- [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- multimodal, video-text, video-captioning, large-scale, pretraining, video-understanding, computer-vision, nlp, text-to-video-retrieval, temporal-reasoning
- Added
- 2026-03-17
- Completeness
- 0.85%
Index Score
62.7Adoption
68
Quality
80
Freshness
50
Citations
78
Engagement
0