Skip to main content
Datasetsyntheticv1.0

Cosmopedia

by Hugging Face · open-source · Last verified 2026-03-17

Cosmopedia is a massive synthetic dataset of 30 million textbooks, blog posts, stories, and WikiHow articles generated by Mixtral-8x7B-Instruct. It was designed to replicate the style of high-quality educational content used in Phi-1 at web scale, covering hundreds of topics across multiple languages to support large-scale pretraining.

https://huggingface.co/datasets/HuggingFaceTB/cosmopedia
B
BAbove Average
Adoption: B+Quality: AFreshness: ACitations: B+Engagement: F

Specifications

License
ODC-By 1.0
Pricing
open-source
Capabilities
pretraining, synthetic-data-generation, instruction-following
Integrations
huggingface-datasets
Use Cases
language-model-pretraining, educational-content-generation, curriculum-learning
API Available
Yes
Tags
synthetic, textbooks, web-crawl, pretraining, multilingual
Added
2026-03-17
Completeness
100%

Index Score

65.1
Adoption
78
Quality
82
Freshness
88
Citations
70
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service