C4 (Colossal Clean Crawled Corpus)
by Google · open-source · Last verified 2026-03-17
A cleaned version of Common Crawl comprising approximately 750 GB of English web text, created by Google as the pretraining corpus for the T5 model family. The aggressive heuristic cleaning pipeline removed boilerplate, offensive content, and non-English text, producing a high-quality corpus that remains widely used for language model training and fine-tuning.
https://huggingface.co/datasets/allenai/c4 ↗B+
B+—Good
Adoption: AQuality: AFreshness: C+Citations: A+Engagement: F
Specifications
- License
- ODC-BY
- Pricing
- open-source
- Capabilities
- language-modeling, pretraining, fine-tuning
- Integrations
- hugging-face, tensorflow-datasets
- Use Cases
- llm-pretraining, transfer-learning, research
- API Available
- Yes
- Tags
- nlp, pretraining, web-crawl, cleaned, t5
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
74.2Adoption
87
Quality
83
Freshness
52
Citations
91
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.