Common Crawl
by Common Crawl Foundation · open-source · Last verified 2026-03-17
The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.
https://commoncrawl.org ↗B+
B+—Good
Adoption: A+Quality: BFreshness: A+Citations: A+Engagement: F
Specifications
- License
- Custom
- Pricing
- open-source
- Capabilities
- language-modeling, pretraining, multilingual-training
- Integrations
- amazon-s3, apache-spark, ray
- Use Cases
- llm-pretraining, web-analysis, research
- API Available
- No
- Tags
- nlp, web-crawl, massive-scale, multilingual, foundation
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
76.4Adoption
97
Quality
68
Freshness
97
Citations
96
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.