Skip to main content
DatasetLLMsv1.0

C4 (Colossal Clean Crawled Corpus)

by Google · open-source · Last verified 2026-03-17

A cleaned version of Common Crawl comprising approximately 750 GB of English web text, created by Google as the pretraining corpus for the T5 model family. The aggressive heuristic cleaning pipeline removed boilerplate, offensive content, and non-English text, producing a high-quality corpus that remains widely used for language model training and fine-tuning.

https://huggingface.co/datasets/allenai/c4
B+
B+Good
Adoption: AQuality: AFreshness: C+Citations: A+Engagement: F

Specifications

License
ODC-BY
Pricing
open-source
Capabilities
language-modeling, pretraining, fine-tuning
Integrations
hugging-face, tensorflow-datasets
Use Cases
llm-pretraining, transfer-learning, research
API Available
Yes
Tags
nlp, pretraining, web-crawl, cleaned, t5
Added
2026-03-17
Completeness
100%

Index Score

74.2
Adoption
87
Quality
83
Freshness
52
Citations
91
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service