Skip to main content
DatasetLLMsv1.0

FineWeb

by Hugging Face · open-source · Last verified 2026-03-17

A 15 trillion token, high-quality English web dataset curated by Hugging Face from 96 CommonCrawl snapshots using a carefully designed multi-stage filtering pipeline. FineWeb achieves state-of-the-art pretraining performance among open datasets and ships with FineWeb-Edu, a 1.3T token educational content subset filtered using an LLM-based quality classifier.

https://huggingface.co/datasets/HuggingFaceFW/fineweb
B+
B+Good
Adoption: AQuality: A+Freshness: A+Citations: B+Engagement: F

Specifications

License
ODC-BY
Pricing
open-source
Capabilities
language-modeling, pretraining, educational-content
Integrations
hugging-face, datatrove
Use Cases
llm-pretraining, research, educational-model-training
API Available
Yes
Tags
nlp, pretraining, web-crawl, quality-filtered, hugging-face
Added
2026-03-17
Completeness
100%

Index Score

70.1
Adoption
80
Quality
93
Freshness
92
Citations
78
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service