Skip to main content
Datasetknowledgev20231101

Wikipedia (Processed)

by Wikimedia Foundation / Hugging Face · open-source · Last verified 2026-03-17

The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.

https://huggingface.co/datasets/wikipedia
A
AGreat
Adoption: A+Quality: AFreshness: ACitations: A+Engagement: F

Specifications

License
CC BY-SA 4.0
Pricing
open-source
Capabilities
pretraining, rag-knowledge-base, open-domain-qa
Integrations
huggingface-datasets, langchain
Use Cases
language-model-pretraining, rag-retrieval, knowledge-grounding
API Available
Yes
Tags
wikipedia, encyclopedic, pretraining, multilingual, text
Added
2026-03-17
Completeness
100%

Index Score

80.2
Adoption
97
Quality
88
Freshness
80
Citations
95
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service