DatasetLLMsv2024-10

Common Crawl

by Common Crawl Foundation · open-source · Last verified 2026-03-17

The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.

https://commoncrawl.org ↗

C+

C+—Average

Adoption: A+Quality: BFreshness: A+Citations: FEngagement: F

Specifications

License: Custom
Pricing: open-source
Capabilities: language-modeling, pretraining, multilingual-training
Integrations: amazon-s3, apache-spark, ray
Use Cases: llm-pretraining, web-analysis, research
API Available: No
Tags: nlp, web-crawl, massive-scale, multilingual, foundation
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service