Datasetmultilingualv3.1.0

mC4

by Google · open-source · Last verified 2026-03-17

The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.

https://huggingface.co/datasets/mc4 ↗

B+

B+—Good

Adoption: AQuality: B+Freshness: B+Citations: AEngagement: F

Specifications

License: ODC-BY
Pricing: open-source
Capabilities: multilingual-pre-training, language-modeling
Integrations: huggingface-datasets, tensorflow-datasets
Use Cases: pre-training, multilingual-nlp, language-modeling
API Available: No
Tags: multilingual, web-crawl, pre-training, 101-languages, google
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service