OpenWebText
by EleutherAI · free · Last verified 2026-03-17
OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.
https://huggingface.co/datasets/openwebtext ↗B
B—Above Average
Adoption: B+Quality: AFreshness: CCitations: B+Engagement: F
Specifications
- License
- CC0-1.0
- Pricing
- free
- Capabilities
- Unsupervised language model pretraining, Text generation research and benchmarking, Corpus linguistics analysis, Transfer learning for downstream NLP tasks, Replication studies of large language models, Development of text data filtering techniques, Studying biases in social media-curated content
- Integrations
- [object Object], [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- Yes
- Tags
- nlp, web-text, reddit, open-source, gpt-2, language-modeling, pretraining-corpus, text-generation, unsupervised-learning, english-corpus, data-curation
- Added
- 2026-03-17
- Completeness
- 0.6%
Index Score
66.4Adoption
76
Quality
81
Freshness
40
Citations
79
Engagement
0