Skip to main content
DatasetLLMsv2024-10

Common Crawl

by Common Crawl Foundation · open-source · Last verified 2026-03-17

The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.

https://commoncrawl.org
B+
B+Good
Adoption: A+Quality: BFreshness: A+Citations: A+Engagement: F

Specifications

License
Custom
Pricing
open-source
Capabilities
language-modeling, pretraining, multilingual-training
Integrations
amazon-s3, apache-spark, ray
Use Cases
llm-pretraining, web-analysis, research
API Available
No
Tags
nlp, web-crawl, massive-scale, multilingual, foundation
Added
2026-03-17
Completeness
100%

Index Score

76.4
Adoption
97
Quality
68
Freshness
97
Citations
96
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service