Datasetsyntheticv1.0

Cosmopedia

by Hugging Face · free · Last verified 2026-03-17

Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.

https://huggingface.co/datasets/HuggingFaceTB/cosmopedia ↗

B—Above Average

Adoption: B+Quality: AFreshness: ACitations: B+Engagement: F

Specifications

License: ODC-By 1.0
Pricing: free
Capabilities: large-scale-llm-pretraining, synthetic-data-generation-research, multilingual-model-development, domain-specific-model-tuning, instruction-following-training, data-augmentation-for-text-corpora, knowledge-base-creation, educational-content-analysis
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: Yes
Tags: synthetic-data, text-corpus, llm-pretraining, multilingual, educational-content, mixtral, knowledge-base, instruction-data, open-data
Added: 2026-03-17
Completeness: 0.9%

Index Score

65.1

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service