Skip to main content
Datasetmultilingualv1.0

CulturaX

by University of Oregon · open-source · Last verified 2026-03-17

CulturaX is a massive cleaned multilingual corpus of 6.3 trillion tokens across 167 languages, combining and deduplicating mC4 and OSCAR with rigorous language-model-based quality filtering. It is currently one of the largest and cleanest publicly available multilingual pre-training datasets.

https://huggingface.co/datasets/uonlp/CulturaX
B
BAbove Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
multilingual-pre-training, language-modeling, data-cleaning
Integrations
huggingface-datasets
Use Cases
pre-training, multilingual-nlp, large-scale-training
API Available
No
Tags
multilingual, pre-training, cleaned, 167-languages, mC4-oscar
Added
2026-03-17
Completeness
100%

Index Score

63.2
Adoption
68
Quality
90
Freshness
78
Citations
72
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service