DatasetLLMsv2023-06

PushShift Reddit Dataset

by PushShift.io · free · Last verified 2026-03-17

A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.

https://pushshift.io ↗

C—Below Average

Adoption: B+Quality: B+Freshness: CCitations: FEngagement: F

Specifications

License: Custom
Pricing: free
Capabilities: large-scale language model pre-training, instruction-following dataset creation, social science and computational linguistics research, dialogue system training, sentiment analysis and opinion mining, community and network analysis, trend detection and analysis, misinformation and hate speech detection research
Integrations: [object Object], [object Object], [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: No
Tags: nlp, social-media, dialogue, reddit, conversational, large-scale-dataset, text-corpus, social-science, instruction-tuning, pre-training, user-generated-content
Added: 2026-03-17
Completeness: 0.6%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service