PushShift Reddit Dataset
by PushShift.io · free · Last verified 2026-03-17
A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.
https://pushshift.io ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: CCitations: B+Engagement: F
Specifications
- License
- Custom
- Pricing
- free
- Capabilities
- large-scale language model pre-training, instruction-following dataset creation, social science and computational linguistics research, dialogue system training, sentiment analysis and opinion mining, community and network analysis, trend detection and analysis, misinformation and hate speech detection research
- Integrations
- [object Object], [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- nlp, social-media, dialogue, reddit, conversational, large-scale-dataset, text-corpus, social-science, instruction-tuning, pre-training, user-generated-content
- Added
- 2026-03-17
- Completeness
- 0.6%
Index Score
62Adoption
72
Quality
71
Freshness
42
Citations
76
Engagement
0