DatasetLLMsv2024-11

Wikipedia Dump

by Wikimedia Foundation · open-source · Last verified 2026-03-17

The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.

https://dumps.wikimedia.org ↗

A—Great

Adoption: A+Quality: A+Freshness: ACitations: A+Engagement: F

Specifications

License: CC-BY-SA-4.0
Pricing: open-source
Capabilities: language-modeling, question-answering, fact-checking, pretraining
Integrations: hugging-face, tensorflow-datasets
Use Cases: llm-pretraining, qa-systems, knowledge-grounding, rag
API Available: Yes
Tags: nlp, encyclopedic, factual, multilingual, pretraining
Added: 2026-03-17
Completeness: 100%

Index Score

80.2

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service