CulturaX
by University of Oregon · open-source · Last verified 2026-03-17
CulturaX is a massive cleaned multilingual corpus of 6.3 trillion tokens across 167 languages, combining and deduplicating mC4 and OSCAR with rigorous language-model-based quality filtering. It is currently one of the largest and cleanest publicly available multilingual pre-training datasets.
https://huggingface.co/datasets/uonlp/CulturaX ↗B
B—Above Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- multilingual-pre-training, language-modeling, data-cleaning
- Integrations
- huggingface-datasets
- Use Cases
- pre-training, multilingual-nlp, large-scale-training
- API Available
- No
- Tags
- multilingual, pre-training, cleaned, 167-languages, mC4-oscar
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
63.2Adoption
68
Quality
90
Freshness
78
Citations
72
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.