Wikipedia (Processed)
by Wikimedia Foundation / Hugging Face · open-source · Last verified 2026-03-17
The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.
https://huggingface.co/datasets/wikipedia ↗A
A—Great
Adoption: A+Quality: AFreshness: ACitations: A+Engagement: F
Specifications
- License
- CC BY-SA 4.0
- Pricing
- open-source
- Capabilities
- pretraining, rag-knowledge-base, open-domain-qa
- Integrations
- huggingface-datasets, langchain
- Use Cases
- language-model-pretraining, rag-retrieval, knowledge-grounding
- API Available
- Yes
- Tags
- wikipedia, encyclopedic, pretraining, multilingual, text
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
80.2Adoption
97
Quality
88
Freshness
80
Citations
95
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.