Skip to main content
DatasetLLMsv2.0

RedPajama v2

by Together AI · open-source · Last verified 2026-03-17

A massive open-source dataset containing 30 trillion tokens sourced from five CommonCrawl snapshots, designed to reproduce the LLaMA 2 training data recipe. It includes quality signals and deduplication metadata to enable researchers to filter and curate custom subsets for their own pretraining runs.

https://together.ai/blog/redpajama-data-v2
B
BAbove Average
Adoption: B+Quality: AFreshness: B+Citations: AEngagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
language-modeling, pretraining, data-curation
Integrations
hugging-face, apache-spark, ray
Use Cases
llm-pretraining, research, data-curation
API Available
Yes
Tags
nlp, pretraining, large-scale, web-crawl, open-source
Added
2026-03-17
Completeness
100%

Index Score

68.5
Adoption
78
Quality
84
Freshness
78
Citations
82
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service