Skip to main content
brand
context
industry
strategy
AaaS
DatasetLLMsv1.0

OpenWebText

by EleutherAI · free · Last verified 2026-03-17

OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.

https://huggingface.co/datasets/openwebtext
B
BAbove Average
Adoption: B+Quality: AFreshness: CCitations: B+Engagement: F

Specifications

License
CC0-1.0
Pricing
free
Capabilities
Unsupervised language model pretraining, Text generation research and benchmarking, Corpus linguistics analysis, Transfer learning for downstream NLP tasks, Replication studies of large language models, Development of text data filtering techniques, Studying biases in social media-curated content
Integrations
[object Object], [object Object], [object Object], [object Object]
Use Cases
[object Object], [object Object], [object Object], [object Object], [object Object]
API Available
Yes
Tags
nlp, web-text, reddit, open-source, gpt-2, language-modeling, pretraining-corpus, text-generation, unsupervised-learning, english-corpus, data-curation
Added
2026-03-17
Completeness
0.6%

Index Score

66.4
Adoption
76
Quality
81
Freshness
40
Citations
79
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service