Cosmopedia
by Hugging Face · free · Last verified 2026-03-17
Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.
https://huggingface.co/datasets/HuggingFaceTB/cosmopedia ↗B
B—Above Average
Adoption: B+Quality: AFreshness: ACitations: B+Engagement: F
Specifications
- License
- ODC-By 1.0
- Pricing
- free
- Capabilities
- large-scale-llm-pretraining, synthetic-data-generation-research, multilingual-model-development, domain-specific-model-tuning, instruction-following-training, data-augmentation-for-text-corpora, knowledge-base-creation, educational-content-analysis
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- Yes
- Tags
- synthetic-data, text-corpus, llm-pretraining, multilingual, educational-content, mixtral, knowledge-base, instruction-data, open-data
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
65.1Adoption
78
Quality
82
Freshness
88
Citations
70
Engagement
0