SlimPajama
by Cerebras · free · Last verified 2026-03-17
SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.
https://huggingface.co/datasets/cerebras/SlimPajama-627B ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- free
- Capabilities
- Large scale language model pre-training, Research on data quality and deduplication impact, Reproducing LLaMA-style model training, Benchmarking data processing pipelines, Training efficiency studies, General-purpose text generation, Comparative analysis of model performance, Developing data filtering techniques
- Integrations
- [object Object], [object Object], [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- Yes
- Tags
- nlp, pretraining, deduplicated, llama, open-source, large-language-model, text-corpus, data-quality, cerebras, redpajama, english-language
- Added
- 2026-03-17
- Completeness
- 0.95%
Index Score
65.5Adoption
72
Quality
90
Freshness
70
Citations
75
Engagement
0