Skip to main content
DatasetLLMsv1.0

OpenWebText

by EleutherAI · open-source · Last verified 2026-03-17

An open-source recreation of OpenAI's WebText dataset, constructed by scraping URLs shared on Reddit with at least 3 upvotes. Created to replicate the training data of GPT-2 without direct access to the original corpus, it contains approximately 38 GB of high-quality English web pages curated through social signal filtering.

https://huggingface.co/datasets/openwebtext
B
BAbove Average
Adoption: B+Quality: AFreshness: CCitations: B+Engagement: F

Specifications

License
CC0-1.0
Pricing
open-source
Capabilities
language-modeling, pretraining
Integrations
hugging-face
Use Cases
llm-pretraining, research, baseline-training
API Available
Yes
Tags
nlp, web-text, reddit, open-source, gpt-2
Added
2026-03-17
Completeness
100%

Index Score

66.4
Adoption
76
Quality
81
Freshness
40
Citations
79
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service