Datasetknowledgev20231101

Wikipedia (Processed)

by Wikimedia Foundation / Hugging Face · open-source · Last verified 2026-03-17

The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.

https://huggingface.co/datasets/wikipedia ↗

A—Great

Adoption: A+Quality: AFreshness: ACitations: A+Engagement: F

Specifications

License: CC BY-SA 4.0
Pricing: open-source
Capabilities: pretraining, rag-knowledge-base, open-domain-qa
Integrations: huggingface-datasets, langchain
Use Cases: language-model-pretraining, rag-retrieval, knowledge-grounding
API Available: Yes
Tags: wikipedia, encyclopedic, pretraining, multilingual, text
Added: 2026-03-17
Completeness: 100%

Index Score

80.2

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service