DatasetLLMsv1.0

OpenWebText

by EleutherAI · free · Last verified 2026-03-17

OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.

https://huggingface.co/datasets/openwebtext ↗

B—Above Average

Adoption: B+Quality: AFreshness: CCitations: B+Engagement: F

Specifications

License: CC0-1.0
Pricing: free
Capabilities: Unsupervised language model pretraining, Text generation research and benchmarking, Corpus linguistics analysis, Transfer learning for downstream NLP tasks, Replication studies of large language models, Development of text data filtering techniques, Studying biases in social media-curated content
Integrations: [object Object], [object Object], [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: Yes
Tags: nlp, web-text, reddit, open-source, gpt-2, language-modeling, pretraining-corpus, text-generation, unsupervised-learning, english-corpus, data-curation
Added: 2026-03-17
Completeness: 0.6%

Index Score

66.4

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service