Skip to main content
DatasetLLMsv2023-06

PushShift Reddit Dataset

by PushShift.io · open-source · Last verified 2026-03-17

A large-scale archive of Reddit submissions and comments collected by the PushShift project, covering hundreds of billions of tokens of conversational text across thousands of subreddit communities from 2005 through 2023. It is widely used for training dialogue systems, studying online discourse, social NLP research, and constructing instruction-following datasets via post-processing.

https://pushshift.io
B
BAbove Average
Adoption: B+Quality: B+Freshness: CCitations: B+Engagement: F

Specifications

License
Custom
Pricing
open-source
Capabilities
dialogue-modeling, conversational-ai, social-analysis
Integrations
hugging-face, apache-spark
Use Cases
dialogue-system-training, social-research, instruction-tuning
API Available
No
Tags
nlp, social-media, dialogue, reddit, conversational
Added
2026-03-17
Completeness
100%

Index Score

62
Adoption
72
Quality
71
Freshness
42
Citations
76
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service