Skip to main content
DatasetLLMsv2024-11

Wikipedia Dump

by Wikimedia Foundation · open-source · Last verified 2026-03-17

The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.

https://dumps.wikimedia.org
A
AGreat
Adoption: A+Quality: A+Freshness: ACitations: A+Engagement: F

Specifications

License
CC-BY-SA-4.0
Pricing
open-source
Capabilities
language-modeling, question-answering, fact-checking, pretraining
Integrations
hugging-face, tensorflow-datasets
Use Cases
llm-pretraining, qa-systems, knowledge-grounding, rag
API Available
Yes
Tags
nlp, encyclopedic, factual, multilingual, pretraining
Added
2026-03-17
Completeness
100%

Index Score

80.2
Adoption
95
Quality
90
Freshness
88
Citations
97
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service