CulturaX
by University of Oregon · free · Last verified 2026-03-17
CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.
https://huggingface.co/datasets/uonlp/CulturaX ↗B
B—Above Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- free
- Capabilities
- multilingual language model pre-training, cross-lingual transfer learning, large-scale text data analysis, natural language processing research, corpus linguistics studies, benchmarking data cleaning techniques, training models for low-resource languages, text generation model development
- Integrations
- [object Object], [object Object], [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- multilingual-corpus, pre-training-dataset, llm-training, natural-language-processing, data-cleaning, web-corpus, mc4, oscar, hugging-face, big-data, text-data
- Added
- 2026-03-17
- Completeness
- 0.6%
Index Score
63.2Adoption
68
Quality
90
Freshness
78
Citations
72
Engagement
0