Skip to main content
brand
context
industry
strategy
AaaS
Datasetmultilingualv1.0

CulturaX

by University of Oregon · free · Last verified 2026-03-17

CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.

https://huggingface.co/datasets/uonlp/CulturaX
B
BAbove Average
Adoption: BQuality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
free
Capabilities
multilingual language model pre-training, cross-lingual transfer learning, large-scale text data analysis, natural language processing research, corpus linguistics studies, benchmarking data cleaning techniques, training models for low-resource languages, text generation model development
Integrations
[object Object], [object Object], [object Object], [object Object], [object Object]
Use Cases
[object Object], [object Object], [object Object], [object Object], [object Object]
API Available
No
Tags
multilingual-corpus, pre-training-dataset, llm-training, natural-language-processing, data-cleaning, web-corpus, mc4, oscar, hugging-face, big-data, text-data
Added
2026-03-17
Completeness
0.6%

Index Score

63.2
Adoption
68
Quality
90
Freshness
78
Citations
72
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service