Explore.

image-classificationobject-recognitionbenchmark

ImageNet-1K

by ImageNet / Stanford Vision Lab

The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.

83.3A

object-detectionsegmentationkeypoints

COCO 2017

by Microsoft

Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.

82.5A

proteinsstructuresbiology

Protein Data Bank

by RCSB PDB / wwPDB Consortium

The RCSB Protein Data Bank (PDB) is the single worldwide archive of experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies, currently containing over 220,000 biological macromolecular structures determined by X-ray crystallography, NMR, and cryo-EM. It is the foundational structural dataset for computational biology and was used to train and validate AlphaFold2 and other structure-prediction models.

81.9A

UniProt

by UniProt Consortium (EMBL-EBI / SIB / PIR)

UniProt (Universal Protein Resource) is the world's comprehensive, freely accessible protein sequence and functional information database, maintained by a consortium of EMBL-EBI, SIB, and PIR. It contains over 250 million protein sequences in UniParc, with 570,000+ manually reviewed entries in SwissProt providing expert-curated functional annotations, and serves as the gold-standard training source for protein language models.

proteinsbiologysequences

80.9A

benchmarkmultiple-choiceknowledge

MMLU Dataset

by UC Berkeley

Massive Multitask Language Understanding (MMLU) is a benchmark covering 57 academic subjects from STEM to humanities, with 14,000+ multiple-choice questions at undergraduate and professional level. It has become the de facto standard for measuring broad world knowledge and academic reasoning in LLMs.

80.9A

Datasetknowledge

Wikipedia (Processed)

by Wikimedia Foundation / Hugging Face

The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.

wikipediaencyclopedicpretraining

80.2A

Wikipedia Dump

by Wikimedia Foundation

The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.

nlpencyclopedicfactual

80.2A

automatic-speech-recognitionASRenglish

LibriSpeech

by OpenSLR / Johns Hopkins University

LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.

80.2A

benchmarkmathgrade-school

GSM8K Dataset

by OpenAI

Grade School Math 8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems requiring 2-8 step reasoning. Created by OpenAI, GSM8K is widely used for evaluating multi-step arithmetic reasoning and the effectiveness of chain-of-thought prompting.

79.8B+

chemistrymoleculesbioassay

PubChem

by NCBI / NIH

PubChem is the world's largest open chemical database maintained by the NCBI, containing information on over 115 million compounds, 295 million substances, and 270 million bioactivity outcomes from more than 1.2 million assays. It provides standardized molecular structures, properties, and biological activity data freely accessible via REST API and bulk download, making it the canonical resource for cheminformatics and drug discovery research.

79.6B+

Datasetai-datasets

GENIE Benchmark

by Stanford University

The GENIE Benchmark is a comprehensive dataset for evaluating the performance of text-to-SQL models. It includes a diverse set of SQL queries and corresponding natural language questions across multiple domains, designed to assess the generalization capabilities of these models.

text-to-sqlnatural language processingdatabase

79.2B+

HumanEval Dataset

by OpenAI

A curated set of 164 handwritten Python programming problems released by OpenAI, each consisting of a function signature, docstring, reference solution, and unit tests. HumanEval introduced the pass@k metric for functional code correctness evaluation and has become the de facto standard benchmark reported in virtually every code generation model paper.

codeevaluationpython

79B+

MIMIC-IV

by MIT Laboratory for Computational Physiology / Beth Israel Deaconess Medical Center

MIMIC-IV (Medical Information Mart for Intensive Care) is a comprehensive de-identified electronic health record database covering over 300,000 patients admitted to Beth Israel Deaconess Medical Center's ICU between 2008 and 2019. It contains detailed clinical data including diagnoses, procedures, medications, laboratory values, and waveforms, enabling a wide range of clinical AI research.

ehrclinicalicu

78.8B+

benchmarkcompetition-mathhard-math

MATH Dataset

by UC Berkeley

A challenging benchmark of 12,500 competition mathematics problems from AMC, AIME, and similar competitions across 5 difficulty levels and 7 subjects. Each problem includes a full step-by-step solution in LaTeX, making it suitable for both evaluation and training of mathematical reasoning.

77.3B+

segmentationSAMfoundation-model

SA-1B (Segment Anything)

by Meta AI

SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.

77.2B+

benchmarkcommonsensesentence-completion

HellaSwag Dataset

by University of Washington

HellaSwag is an adversarially filtered commonsense NLI benchmark where models must pick the most plausible sentence completion from 4 options. Humans score 95%+ while early LLMs struggled below 50%, making it a robust test of grounded language understanding and commonsense reasoning.

77B+

nlpweb-crawlmassive-scale

Common Crawl

by Common Crawl Foundation

The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.

76.4B+

benchmarkscience-questionsmultiple-choice

ARC Dataset

by Allen Institute for AI

The AI2 Reasoning Challenge (ARC) dataset contains 7,787 grade 3–9 science exam questions split into Easy and Challenge partitions. The Challenge set contains questions that require deeper reasoning and world knowledge, making it a reliable signal for advanced language understanding.

76.2B+

object-detectionsegmentationvisual-relationships

Open Images V7

by Google

Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.

76.1B+

benchmarktruthfulnesshallucination

TruthfulQA Dataset

by University of Oxford

TruthfulQA measures the truthfulness of LLMs across 817 adversarially crafted questions spanning 38 categories where humans are commonly misled by false beliefs. Models are scored on generating truthful AND informative answers, revealing how larger models can paradoxically become more confidently wrong.

75.1B+

Datasetknowledge

Stack Exchange Dump

by Stack Exchange

The Stack Exchange Data Dump is a quarterly XML export of all public questions, answers, comments, and votes across the entire Stack Exchange network of 170+ Q&A communities including Stack Overflow. Containing hundreds of millions of high-quality technical and domain-specific Q&A pairs, it is a critical pretraining source for code and reasoning capabilities and a standard retrieval benchmark for dense passage retrieval.

qacommunitycode

75B+

benchmarknlp-benchmarknatural-language-understanding

SuperGLUE

by New York University

SuperGLUE is a benchmark suite of 8 challenging NLU tasks including question answering, coreference resolution, causal reasoning, and word-sense disambiguation, designed as a harder successor to GLUE. It includes human baselines and has driven significant progress in pre-trained language model capabilities.

74.5B+

multimodalimage-textlarge-scale

LAION-5B

by LAION

The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.

74.2B+

semantic-segmentationscene-parsingscene-understanding

ADE20K Dataset

by MIT CSAIL

ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.

74.2B+

benchmarkcommonsensewinograd-schema

WinoGrande Dataset

by Allen Institute for AI

WinoGrande is a large-scale crowdsourced dataset of 44,000 Winograd-style fill-in-the-blank commonsense problems, debiased using the AFLITE algorithm to minimize spurious statistical cues. It is significantly harder than the original Winograd Schema Challenge for contemporary NLP models.

73.8B+

audio-classificationsound-eventslarge-scale

AudioSet

by Google

Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.

73.7B+

chest-x-rayradiologymulti-label

CheXpert

by Stanford ML Group

CheXpert is a large chest X-ray dataset from Stanford containing 224,316 chest radiographs from 65,240 patients with labels for 14 observations mined from radiology reports using an automated labeler. It uniquely addresses label uncertainty with positive, negative, and uncertain labels, making it a challenging and realistic benchmark for automated chest X-ray interpretation.

73.2B+

biomedical-nlpscientific-literaturefull-text

PubMedCentral OA

by National Institutes of Health / National Library of Medicine

PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.

73.1B+

speaker-verificationspeaker-recognitionin-the-wild

VoxCeleb2

by Oxford Visual Geometry Group (VGG)

VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.

73B+

evaluationmachine-translation200-languages

FLORES-200 Dataset

by Meta AI

FLORES-200 is Meta's few-shot translation evaluation benchmark spanning 200 languages, including many low-resource and endangered ones. Each language contains 1,012 parallel sentences translated from English Wikipedia, covering both devtest and test splits for systematic MT evaluation at scale.

73B+

instruction-followingself-instructstanford

Alpaca Dataset

by Stanford University

Stanford Alpaca's 52,000 instruction-following examples generated using the self-instruct technique applied to GPT-3.5 (text-davinci-003). This foundational dataset enabled the creation of the Alpaca 7B model and popularized cost-effective instruction-tuning approaches.

73B+

ASRmultilingualcrowdsourced

Common Voice 15

by Mozilla

Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.

72.6B+

Datasetfinancial

SEC-EDGAR Filings

by U.S. Securities and Exchange Commission

The SEC-EDGAR Filings dataset encompasses over 20 million full-text regulatory filings submitted to the US Securities and Exchange Commission since 1993, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, and proxy statements from all US public companies. It is the foundational corpus for financial NLP research, sentiment analysis, and financial document AI.

financial-nlp10-K10-Q

72.5B+

MBPP (Mostly Basic Python Problems)

by Google

A dataset of 974 crowd-sourced Python programming problems suitable for entry-level programmers, each with a problem description, code solution, and three automated test cases. MBPP complements HumanEval by covering a broader variety of programming concepts and is widely used alongside it for comprehensive evaluation of code generation capabilities across model families.

codeevaluationpython

72.5B+

codepretrainingpermissive-license

The Stack v2

by BigCode

An expanded code pretraining dataset containing 3 trillion tokens of source code in 619 programming languages, curated by BigCode from GitHub repositories with permissive SPDX licenses. Version 2 triples the size of the original Stack and includes improved deduplication, opt-out mechanisms for authors, and structured data from GitHub issues and pull requests alongside raw source files.

72.3B+

face-generationGANhigh-resolution

CelebA-HQ

by NVIDIA / CUHK

CelebA-HQ is a high-quality version of the CelebA face dataset containing 30,000 celebrity images at 1024×1024 resolution with 40 binary attribute annotations. It was introduced alongside Progressive GAN and has become the standard benchmark for high-fidelity face generation and synthesis research.

72.3B+

scientific-paperspreprintsnlp

ArXiv Papers Dataset

by Cornell University / arXiv

The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.

72.2B+

rlhfinstruction-followingconversations

OpenAssistant Conversations

by LAION

A large-scale, human-annotated dataset of assistant-style conversations collected through the OpenAssistant crowdsourcing platform. Contains over 161,000 messages across 66,000+ conversation trees, with ranked responses for RLHF training.

72B+

multilingualweb-crawlpre-training

mC4

by Google

The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.

72B+

scientific-papersopen-researchfull-text

Semantic Scholar ORC

by Allen Institute for AI (AI2)

The Semantic Scholar Open Research Corpus (S2ORC) is a large English-language corpus of 136 million academic papers with structured metadata, abstracts, citation graphs, and full-text body paragraphs where licensing allows. Maintained by the Allen Institute for AI, it covers 19 scientific fields and is widely used for scientific NLP tasks including citation prediction, claim verification, and scientific QA.

71.7B+

BookCorpus

by University of Toronto

A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.

nlpbookslong-form

71.3B+

scene-recognitionscene-classificationtransfer-learning

Places365

by MIT CSAIL

Places365 is a scene-centric database with 1.8 million training images across 365 scene categories, designed to train and evaluate scene recognition models. The dataset enables models to understand the semantic meaning of places and environments, making it ideal for applications in autonomous driving, robotics, and image retrieval.

70.7B+

codecode-searchdocumentation

CodeSearchNet

by GitHub / Microsoft Research

A dataset and benchmark challenge for code retrieval and search containing 2 million (code, documentation) pairs in six programming languages — Python, Java, JavaScript, PHP, Ruby, and Go — curated by GitHub and Microsoft Research. It is the canonical benchmark for code-to-natural-language and natural-language-to-code retrieval tasks and is widely used to evaluate code embedding models.

70.4B+

codecompetitive-programmingevaluation

APPS (Automated Programming Progress Standard)

by UC Berkeley

A benchmark of 10,000 programming problems at introductory, interview, and competitive programming difficulty levels, each with problem statements, test cases, and human-written solutions. APPS is the standard dataset for evaluating code generation models on realistic programming tasks ranging from simple loops to complex algorithmic challenges drawn from competitive programming platforms.

70.3B+

rlhfpreference-datagpt-4-annotated

UltraFeedback

by Tsinghua University

A large-scale, high-quality preference dataset with 64,000 instructions each answered by 4 LLMs and rated by GPT-4 on instruction-following, truthfulness, honesty, and helpfulness. UltraFeedback is the backbone of the Zephyr and Tulu 2 DPO models.

70.2B+

Datasetfinancial

Financial PhraseBank

by Pekka Malo et al. / Aalto University

Financial PhraseBank is a sentiment analysis dataset containing 4,845 sentences from English-language financial news annotated by 16 financial domain experts with positive, negative, or neutral sentiment labels. It is the most widely used benchmark for financial sentiment analysis and has been used to fine-tune FinBERT and numerous other financial NLP models.

financial-sentimentNLPsentiment-analysis

70.1B+

instruction-tuningself-playseed-tasks

Self-Instruct

by University of Washington

Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.

69.8B

mathematicsreasoningsymbolic

StarCoderData

by BigCode

The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.

codepretraininggithub

69.7B

Datasetmathematics

DM Mathematics

by Google DeepMind

DeepMind Mathematics (DM Mathematics) is a dataset of 2 million mathematical question-answer pairs covering algebra, arithmetic, calculus, comparisons, measurement, numbers, polynomials, and probability, procedurally generated to test mathematical reasoning capabilities of language models. The symbolic and step-structured nature of the dataset makes it a standard benchmark for evaluating compositional generalization and multi-step arithmetic reasoning.

69.2B

quality-over-quantityinstruction-followingmeta

LIMA

by Meta AI

LIMA (Less Is More for Alignment) is a carefully curated dataset of 1,000 high-quality instruction-response pairs demonstrating that alignment quality matters more than quantity. Sourced from StackExchange, wikiHow, and manually written prompts, LIMA-tuned models rival GPT-4 on many benchmarks.

69.1B

syntheticgpt-4instruction-following

OpenHermes 2.5

by Nous Research

A large curated synthetic instruction dataset with ~1 million entries sourced from multiple high-quality open datasets including Airoboros, Camel, GPT4-LLM, and others. OpenHermes 2.5 powers the Nous Hermes model family and is widely regarded as one of the best open instruction datasets.

68.7B

OASST2

by LAION / OpenAssistant

OpenAssistant Conversations 2 (OASST2) is a crowd-sourced human-annotated dataset of 100,000+ assistant-style conversations in 35 languages, where human contributors created and ranked message trees to produce preference labels for RLHF training. It is the largest open multilingual human-feedback dataset and is widely used for training preference models and reward functions in open-source alignment pipelines.

rlhfhuman-feedbackchat

68.5B

machine-translation200-languagesparallel-corpus

NLLB Training Data

by Meta AI

The No Language Left Behind (NLLB) training corpus released by Meta AI contains high-quality parallel data across 200+ language pairs, including newly mined bitext for dozens of low-resource languages. It was used to train the NLLB-200 model achieving state-of-the-art translation on low-resource language pairs.

68.5B

conversationsgpt-4chatgpt

ShareGPT

by Community

A community-collected dataset of real ChatGPT and GPT-4 conversation logs shared by users, covering a broad range of tasks and domains. Available in multiple filtered and cleaned versions including ShareGPT52K and ShareGPT90K used by Vicuna and other open models.

68.4B

scene-classificationscene-understandinglarge-scale

LSUN

by Princeton / Columbia University

The Large-Scale Scene Understanding (LSUN) dataset is a massive collection of nearly one million labeled images for each of 10 scene and 20 object categories. It is a key benchmark for advancing research in scene understanding, particularly for generative modeling, classification, and reconstruction tasks.

68.3B

instruction-tuningsupervised-fine-tuninghuman-generated-data

Dolly-15K

by Databricks

Dolly-15K is a high-quality, open-source dataset of 15,000 instruction-following records generated by humans. Created by Databricks employees, it's designed for fine-tuning large language models to exhibit instruction-following capabilities, such as those seen in ChatGPT, using a relatively small, targeted dataset.

68.3B

parallel-corpusmachine-translationmultilingual-nlp

OPUS-100

by University of Helsinki

OPUS-100 is a large-scale multilingual parallel corpus for machine translation, featuring 100 languages pivoted through English. Sampled from the OPUS collection, it provides up to 1 million sentence pairs per language pair, making it a standard benchmark for training and evaluating multilingual models.

68.1B

synthetic-datatextbookscoding

Phi-1 TextBooks

by Microsoft

Phi-1 TextBooks is a synthetic dataset of Python coding textbooks and exercises generated by GPT-3.5 and GPT-4. It was created to pretrain Microsoft's Phi-1 small language model, demonstrating that high-quality, curriculum-style data can significantly boost the coding abilities of smaller models compared to training on general web data.

67.7B

GigaSpeech

by Seasalt.ai / SpeechColab

GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.

ASRlarge-scaleenglish

67.7B

evol-instructcomplexity-evolutionsynthetic

WizardLM Evol-Instruct

by Microsoft Research

WizardLM Evol-Instruct is a synthetic dataset created by Microsoft Research for fine-tuning large language models. It uses an LLM-based evolutionary process to iteratively rewrite and complicate a seed set of instructions, progressively increasing their complexity and diversity. The dataset is designed to enhance a model's ability to follow intricate, multi-step commands across various domains like coding, math, and reasoning.

67.2B

question-answeringmultilingualtypologically-diverse

TyDi QA Dataset

by Google Research

TyDi QA is a benchmark for question answering across 11 typologically diverse languages. It features information-seeking questions written by native speakers who have not seen the answer, ensuring real-world applicability. This design challenges models to generalize beyond high-resource, typologically similar languages.

66.9B

multimodalimage-textbenchmark

DataComp-1B

by DataComp Consortium

A curated 1.28 billion image-text pair dataset produced through the DataComp benchmark competition, which challenged participants to filter a 12.8 billion pair candidate pool to produce the best downstream CLIP model. DataComp-1B represents the winning filtering strategy and achieves state-of-the-art zero-shot classification performance among datasets of its size.

66.6B

OpenWebText

by EleutherAI

OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.

nlpweb-textreddit

66.4B

LAION-400M Text Captions

by LAION

The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.

nlpcaptionsimage-text

66.3B

biomedical-qaquestion-answeringsemantic-indexing

BioASQ Dataset

by BioASQ Consortium

The BioASQ dataset is a benchmark for biomedical semantic indexing and question answering. It contains thousands of expert-annotated questions (factoid, list, yes/no, summary) paired with relevant PubMed articles, concepts, and ideal answers, designed to train and evaluate advanced NLP systems in the medical domain.

66.2B

roboticsmanipulationmulti-robot

Open X-Embodiment

by Google DeepMind / Consortium

Open X-Embodiment (OXE) is a massive robotics dataset combining over 1 million demonstration episodes from 22 distinct robot embodiments. It covers 527 skills and is designed to train generalist robot policies that can transfer skills across diverse hardware, serving as a key resource for vision-language-action models.

66.1B

Datasetlegal

Legal-BERT Training Data

by Gerasimos Spanakis / Maastricht University

The Legal-BERT training corpus is a large collection of English legal text assembled from UK legislation, EU legislation, ECHR/ECLI court decisions, and US contracts specifically curated to pretrain domain-adapted BERT models. It has enabled a family of Legal-BERT models that significantly outperform general-domain language models on legal NLP tasks.

legal-nlppretrainingcontracts

65.9B

Datasetai-datasets

GenLaw: A Legal Reasoning Dataset

by Stanford Center for Legal Informatics

GenLaw is a comprehensive dataset designed for evaluating legal reasoning capabilities of large language models. It contains a diverse set of legal questions, case summaries, and relevant statutes, enabling researchers to assess a model's ability to understand and apply legal principles.

legalreasoninglaw

65.8B

nlppretrainingdeduplicated

SlimPajama

by Cerebras

SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.

65.5B

Datasetlegal

EU Court Decisions

by European Court of Human Rights / CJEU

The EU Court Decisions dataset aggregates judgments from the European Court of Human Rights (ECHR) and the Court of Justice of the European Union (CJEU), covering tens of thousands of decisions in multiple EU languages with structured metadata. It is widely used for multilingual legal NLP research, legal judgment prediction, and cross-lingual information retrieval.

european-lawcourt-decisionsmultilingual

65.5B

Datasetcode

Evol-CodeAlpaca

by Microsoft Research

Evol-CodeAlpaca is a dataset of 110,000 instruction-solution pairs for code generation, created by applying the EvolInstruct method to Code Alpaca seeds. Using GPT-4, it progressively increases the complexity and diversity of programming problems, serving as the primary training data for the WizardCoder models.

code-generationinstruction-tuningevol-instruct

65.3B

datasetmultimodalinstruction-tuning

ShareGPT4V

by Shanghai AI Lab

ShareGPT4V is a large-scale, high-quality dataset containing 100,000 image-text pairs generated by GPT-4V. It is specifically designed for the instruction-tuning of open-source large vision-language models (LVLMs). The dataset's detailed captions and conversational QA pairs significantly enhance a model's ability to perform complex scene understanding, OCR, and visual reasoning.

financial-qanumerical-reasoningtable-qa

Datasetfinancial

FinQA Dataset

by Zhiyu Chen et al. / University of California Santa Barbara

FinQA is a large-scale dataset for numerical reasoning over financial data, containing over 8,000 question-answer pairs from S&P 500 earnings reports. Each question requires multi-step reasoning across both unstructured text and structured tables, making it a challenging benchmark for financial AI systems.

synthetic-datatext-corpusllm-pretraining

Cosmopedia

by Hugging Face

Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.

multimodalimage-textweb-crawl

CC12M (Conceptual 12M)

by Google

CC12M is a large-scale dataset by Google containing 12 million image-text pairs from the web. It was created with a less restrictive filtering process than its predecessor, CC3M, to achieve greater scale and diversity. This makes it a foundational resource for pretraining large vision-language models like CLIP and ALIGN.

summarizationmultilingualnews

XL-Sum Dataset

by BUET (Bangladesh University of Engineering and Technology)

XL-Sum is a massive multilingual dataset for abstractive summarization. It consists of over 1 million article-summary pairs scraped from BBC News, covering 44 different languages. This diversity makes it a crucial resource for developing and evaluating cross-lingual and multilingual summarization models.

64.9B

Datasetlegal

CaseText Corpus

by Casetext (acquired by Thomson Reuters)

The CaseText Corpus is a large-scale dataset of US federal and state court decisions. It includes full text, structured metadata, and citation networks, designed for legal research and the development of AI applications like legal language models and case retrieval systems, spanning decades of US jurisprudence.

case-lawlegal-researchcase-retrieval

64.7B

roboticsmanipulationbenchmark

RLBench

by Dyson Robotics Lab / Imperial College London

RLBench is a large-scale robot learning benchmark and dataset built on the CoppeliaSim simulator, providing 100 unique manipulation tasks with demonstrations, observations, and reward functions. It offers RGB, depth, and point-cloud observations for a Franka Panda arm across diverse household tasks, widely used for evaluating imitation learning, reinforcement learning, and multi-task robot policies.

64.2B

synthetic-datamathematicsinstruction-tuning

OpenMathInstruct

by NVIDIA

OpenMathInstruct is a large-scale, synthetic dataset by NVIDIA featuring 1.8M+ math problem-solution pairs. Generated by Mixtral models and verified for correctness, it provides reliable, step-by-step reasoning chains for training and fine-tuning language models on diverse mathematical topics, from arithmetic to competition math.

64.2B

codemultilingual-codegithub

GitHub Code Dataset

by Hugging Face / BigCode

The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.

63.6B

CC-News

by CommonCrawl Foundation

CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.

nlpnewsweb-crawl

63.3B

musicinstrument-recognitionnote-annotations

MusicNet

by University of Washington

MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.

63.2B

multilingual-corpuspre-training-datasetllm-training

CulturaX

by University of Oregon

CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.

63.2B

instruction-tuningsftdata-mixture

Tulu V2 Mix

by Allen Institute for AI (AI2)

Tulu V2 Mix is a curated 326,000-sample mixture of instruction-tuning datasets from AI2. It blends diverse sources like FLAN, Open Assistant, and Code Alpaca to train the Tulu 2 model family. The dataset serves as a benchmark for analyzing the impact of different data sources on model performance and quality.

63.1B

natural-language-inferenceclinical-nlpentailment

MedNLI

by University of Massachusetts / Partners Healthcare

MedNLI is a benchmark dataset for Natural Language Inference (NLI) in the clinical domain. Derived from the MIMIC-III database, it contains over 14,000 sentence pairs from clinical notes, each annotated by a clinician as representing entailment, contradiction, or a neutral relationship, enabling the evaluation of clinical text reasoning.

62.8B

multimodalvideo-textvideo-captioning

WebVid-10M

by University of Oxford

WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.

62.7B

PushShift Reddit Dataset

by PushShift.io

A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.

nlpsocial-mediadialogue

62B

rlhfpreference-datareward-model

Nectar

by UC Berkeley

Nectar is a large-scale, high-quality preference dataset from Berkeley AI Research (BAIR). It contains 183,000 prompts, each with seven ranked responses from diverse models like GPT-4, ChatGPT, and open-source LLMs. It is designed for training robust reward models for RLHF and DPO.

61.6B

roboticsvideomanipulation

RoboNet

by Berkeley AI Research (BAIR)

RoboNet is a large-scale dataset for robot learning, featuring 15 million video frames from diverse robot arms across multiple labs. It is designed to train and benchmark self-supervised visual models, aiming to achieve generalization across different robot morphologies and workspaces without task-specific labels.

60.3B

Orca DPO Pairs

by Intel Labs / Community

Orca DPO Pairs is a synthetic dataset containing 12,000 instruction-following examples. Each example includes a prompt, a high-quality response from GPT-4 (chosen), and a lower-quality response from GPT-3.5 (rejected). It is designed for efficiently aligning language models using Direct Preference Optimization (DPO) without a reward model.

dpopreferencealignment

60.2B

roboticslanguage-conditionedmanipulation

CALVIN

by Albert-Ludwigs-Universität Freiburg

CALVIN is a large-scale dataset and benchmark for long-horizon, language-conditioned robot manipulation. It features over 24 hours of teleoperated demonstration data in a tabletop environment, encompassing 34 distinct skills that can be composed to solve complex, multi-step tasks from natural language instructions.

59.5C+

instruction-tuningdata-selectionquality-filtering

Deita 6K

by HKUST / Community

Deita 6K is an ultra-compact, high-quality instruction-tuning dataset of 6,000 carefully selected samples produced by the Data-Efficient Instruction Tuning for Alignment (DEITA) framework, which scores and filters instruction data by complexity and quality using LLM judges. Despite its small size, models trained on Deita 6K match or outperform those trained on datasets 10-100x larger, demonstrating the power of principled data selection over scale.

58.6C+

syntheticmulti-agentrole-playing

CAMEL-AI Datasets

by CAMEL-AI

The CAMEL-AI Datasets are a collection of synthetic multi-agent conversation datasets generated through the Communicative Agents framework, where AI assistants and user agents collaborate via role-playing to solve tasks. The collection covers coding, math, science, and open-ended reasoning domains, providing diverse instruction-following dialogues useful for SFT and alignment research.

58.2C+

CodeParrot GitHub Code

by Hugging Face

A 50 GB dataset of Python code scraped from GitHub, originally created to train the CodeParrot model as a demonstration of code-focused language model pretraining. It filters repositories for Python files only and applies basic deduplication, making it a lightweight starting point for Python-specific code generation research and experimentation.

codegithubpython

57.4C+

instruction-tuninglong-formdiverse

Capybara

by Argilla / LDJnr

Capybara is a high-quality instruction-tuning dataset of 15,000 diverse, long-form single- and multi-turn conversations synthesized to cover a wide range of topics and response styles, designed to improve model coherence and verbosity on open-ended tasks. It emphasizes narrative quality and conceptual depth over simple factual responses, making it particularly effective for improving chat model fluency and reasoning.

57.4C+