DatasetLLMsv1.0

SlimPajama

by Cerebras · free · Last verified 2026-03-17

SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.

https://huggingface.co/datasets/cerebras/SlimPajama-627B ↗

B—Above Average

Adoption: B+Quality: A+Freshness: B+Citations: B+Engagement: F

Specifications

License: Apache-2.0
Pricing: free
Capabilities: Large scale language model pre-training, Research on data quality and deduplication impact, Reproducing LLaMA-style model training, Benchmarking data processing pipelines, Training efficiency studies, General-purpose text generation, Comparative analysis of model performance, Developing data filtering techniques
Integrations: [object Object], [object Object], [object Object], [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: Yes
Tags: nlp, pretraining, deduplicated, llama, open-source, large-language-model, text-corpus, data-quality, cerebras, redpajama, english-language
Added: 2026-03-17
Completeness: 0.95%

Index Score

65.5

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service