Cosmopedia
by Hugging Face · open-source · Last verified 2026-03-17
Cosmopedia is a massive synthetic dataset of 30 million textbooks, blog posts, stories, and WikiHow articles generated by Mixtral-8x7B-Instruct. It was designed to replicate the style of high-quality educational content used in Phi-1 at web scale, covering hundreds of topics across multiple languages to support large-scale pretraining.
https://huggingface.co/datasets/HuggingFaceTB/cosmopedia ↗B
B—Above Average
Adoption: B+Quality: AFreshness: ACitations: B+Engagement: F
Specifications
- License
- ODC-By 1.0
- Pricing
- open-source
- Capabilities
- pretraining, synthetic-data-generation, instruction-following
- Integrations
- huggingface-datasets
- Use Cases
- language-model-pretraining, educational-content-generation, curriculum-learning
- API Available
- Yes
- Tags
- synthetic, textbooks, web-crawl, pretraining, multilingual
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
65.1Adoption
78
Quality
82
Freshness
88
Citations
70
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.