PushShift Reddit Dataset
by PushShift.io · open-source · Last verified 2026-03-17
A large-scale archive of Reddit submissions and comments collected by the PushShift project, covering hundreds of billions of tokens of conversational text across thousands of subreddit communities from 2005 through 2023. It is widely used for training dialogue systems, studying online discourse, social NLP research, and constructing instruction-following datasets via post-processing.
https://pushshift.io ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: CCitations: B+Engagement: F
Specifications
- License
- Custom
- Pricing
- open-source
- Capabilities
- dialogue-modeling, conversational-ai, social-analysis
- Integrations
- hugging-face, apache-spark
- Use Cases
- dialogue-system-training, social-research, instruction-tuning
- API Available
- No
- Tags
- nlp, social-media, dialogue, reddit, conversational
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
62Adoption
72
Quality
71
Freshness
42
Citations
76
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.