Skip to main content
brand
context
industry
strategy
AaaS
DatasetLLMsv1.0

SlimPajama

by Cerebras · free · Last verified 2026-03-17

SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.

https://huggingface.co/datasets/cerebras/SlimPajama-627B
B
BAbove Average
Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
free
Capabilities
Large scale language model pre-training, Research on data quality and deduplication impact, Reproducing LLaMA-style model training, Benchmarking data processing pipelines, Training efficiency studies, General-purpose text generation, Comparative analysis of model performance, Developing data filtering techniques
Integrations
[object Object], [object Object], [object Object], [object Object], [object Object]
Use Cases
[object Object], [object Object], [object Object], [object Object], [object Object]
API Available
Yes
Tags
nlp, pretraining, deduplicated, llama, open-source, large-language-model, text-corpus, data-quality, cerebras, redpajama, english-language
Added
2026-03-17
Completeness
0.95%

Index Score

65.5
Adoption
72
Quality
90
Freshness
70
Citations
75
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service