Skip to main content
brand
context
industry
strategy
AaaS
DatasetLLMsv2023-06

PushShift Reddit Dataset

by PushShift.io · free · Last verified 2026-03-17

A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.

https://pushshift.io
B
BAbove Average
Adoption: B+Quality: B+Freshness: CCitations: B+Engagement: F

Specifications

License
Custom
Pricing
free
Capabilities
large-scale language model pre-training, instruction-following dataset creation, social science and computational linguistics research, dialogue system training, sentiment analysis and opinion mining, community and network analysis, trend detection and analysis, misinformation and hate speech detection research
Integrations
[object Object], [object Object], [object Object], [object Object]
Use Cases
[object Object], [object Object], [object Object], [object Object], [object Object]
API Available
No
Tags
nlp, social-media, dialogue, reddit, conversational, large-scale-dataset, text-corpus, social-science, instruction-tuning, pre-training, user-generated-content
Added
2026-03-17
Completeness
0.6%

Index Score

62
Adoption
72
Quality
71
Freshness
42
Citations
76
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service