Skip to main content
Datasetmultilingualv3.1.0

mC4

by Google · open-source · Last verified 2026-03-17

The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.

https://huggingface.co/datasets/mc4
B+
B+Good
Adoption: AQuality: B+Freshness: B+Citations: AEngagement: F

Specifications

License
ODC-BY
Pricing
open-source
Capabilities
multilingual-pre-training, language-modeling
Integrations
huggingface-datasets, tensorflow-datasets
Use Cases
pre-training, multilingual-nlp, language-modeling
API Available
No
Tags
multilingual, web-crawl, pre-training, 101-languages, google
Added
2026-03-17
Completeness
100%

Index Score

72
Adoption
86
Quality
79
Freshness
72
Citations
87
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service