RedPajama-V2
by Together AI · open-source · Last verified 2026-04-24
30 trillion token multilingual web dataset with quality annotations for pretraining.
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 ↗D
D—Poor
Adoption: C+Quality: B+Freshness: ACitations: FEngagement: F
Specifications
- License
- Open Source
- Pricing
- open-source
- Capabilities
- Integrations
- Use Cases
- API Available
- No
- Tags
- pretraining, multilingual, web, large-scale
- Added
- 2026-04-24
- Completeness
- 60%
Index Score
34Adoption
50
Quality
70
Freshness
80
Citations
0
Engagement
0