Explore.
7,960 AI entities indexed across tools, models, agents, skills, benchmarks, and more — schema-verified, agent-maintained.
100 entities · dataset
AI2 Reasoning Challenge (ARC)
by Allen Institute for AI (AI2)
The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to encourage research in advanced question-answering. It consists of grade-school science questions specifically crafted to require reasoning beyond simple fact retrieval, posing a significant challenge for AI models.
ImageNet-1K
by ImageNet / Stanford Vision Lab
The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.
COCO 2017
by Microsoft
Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.
Protein Data Bank
by RCSB PDB / wwPDB Consortium
The RCSB Protein Data Bank (PDB) is the single worldwide archive of experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies, currently containing over 220,000 biological macromolecular structures determined by X-ray crystallography, NMR, and cryo-EM. It is the foundational structural dataset for computational biology and was used to train and validate AlphaFold2 and other structure-prediction models.
UniProt
by UniProt Consortium (EMBL-EBI / SIB / PIR)
UniProt (Universal Protein Resource) is the world's comprehensive, freely accessible protein sequence and functional information database, maintained by a consortium of EMBL-EBI, SIB, and PIR. It contains over 250 million protein sequences in UniParc, with 570,000+ manually reviewed entries in SwissProt providing expert-curated functional annotations, and serves as the gold-standard training source for protein language models.
MMLU Dataset
by UC Berkeley
Massive Multitask Language Understanding (MMLU) is a benchmark covering 57 academic subjects from STEM to humanities, with 14,000+ multiple-choice questions at undergraduate and professional level. It has become the de facto standard for measuring broad world knowledge and academic reasoning in LLMs.
Wikipedia (Processed)
by Wikimedia Foundation / Hugging Face
The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.
Wikipedia Dump
by Wikimedia Foundation
The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.
LibriSpeech
by OpenSLR / Johns Hopkins University
LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.
GSM8K Dataset
by OpenAI
Grade School Math 8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems requiring 2-8 step reasoning. Created by OpenAI, GSM8K is widely used for evaluating multi-step arithmetic reasoning and the effectiveness of chain-of-thought prompting.
PubChem
by NCBI / NIH
PubChem is the world's largest open chemical database maintained by the NCBI, containing information on over 115 million compounds, 295 million substances, and 270 million bioactivity outcomes from more than 1.2 million assays. It provides standardized molecular structures, properties, and biological activity data freely accessible via REST API and bulk download, making it the canonical resource for cheminformatics and drug discovery research.
GENIE Benchmark
by Stanford University
The GENIE Benchmark is a comprehensive dataset for evaluating the performance of text-to-SQL models. It includes a diverse set of SQL queries and corresponding natural language questions across multiple domains, designed to assess the generalization capabilities of these models.
HumanEval Dataset
by OpenAI
A curated set of 164 handwritten Python programming problems released by OpenAI, each consisting of a function signature, docstring, reference solution, and unit tests. HumanEval introduced the pass@k metric for functional code correctness evaluation and has become the de facto standard benchmark reported in virtually every code generation model paper.
MIMIC-IV
by MIT Laboratory for Computational Physiology / Beth Israel Deaconess Medical Center
MIMIC-IV (Medical Information Mart for Intensive Care) is a comprehensive de-identified electronic health record database covering over 300,000 patients admitted to Beth Israel Deaconess Medical Center's ICU between 2008 and 2019. It contains detailed clinical data including diagnoses, procedures, medications, laboratory values, and waveforms, enabling a wide range of clinical AI research.
MATH Dataset
by UC Berkeley
A challenging benchmark of 12,500 competition mathematics problems from AMC, AIME, and similar competitions across 5 difficulty levels and 7 subjects. Each problem includes a full step-by-step solution in LaTeX, making it suitable for both evaluation and training of mathematical reasoning.
SA-1B (Segment Anything)
by Meta AI
SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.
HellaSwag Dataset
by University of Washington
HellaSwag is an adversarially filtered commonsense NLI benchmark where models must pick the most plausible sentence completion from 4 options. Humans score 95%+ while early LLMs struggled below 50%, making it a robust test of grounded language understanding and commonsense reasoning.
Common Crawl
by Common Crawl Foundation
The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.
ARC Dataset
by Allen Institute for AI
The AI2 Reasoning Challenge (ARC) dataset contains 7,787 grade 3–9 science exam questions split into Easy and Challenge partitions. The Challenge set contains questions that require deeper reasoning and world knowledge, making it a reliable signal for advanced language understanding.
Open Images V7
by Google
Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.
TruthfulQA Dataset
by University of Oxford
TruthfulQA measures the truthfulness of LLMs across 817 adversarially crafted questions spanning 38 categories where humans are commonly misled by false beliefs. Models are scored on generating truthful AND informative answers, revealing how larger models can paradoxically become more confidently wrong.
Stack Exchange Dump
by Stack Exchange
The Stack Exchange Data Dump is a quarterly XML export of all public questions, answers, comments, and votes across the entire Stack Exchange network of 170+ Q&A communities including Stack Overflow. Containing hundreds of millions of high-quality technical and domain-specific Q&A pairs, it is a critical pretraining source for code and reasoning capabilities and a standard retrieval benchmark for dense passage retrieval.
SuperGLUE
by New York University
SuperGLUE is a benchmark suite of 8 challenging NLU tasks including question answering, coreference resolution, causal reasoning, and word-sense disambiguation, designed as a harder successor to GLUE. It includes human baselines and has driven significant progress in pre-trained language model capabilities.
LAION-5B
by LAION
The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.
ADE20K Dataset
by MIT CSAIL
ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.
WinoGrande Dataset
by Allen Institute for AI
WinoGrande is a large-scale crowdsourced dataset of 44,000 Winograd-style fill-in-the-blank commonsense problems, debiased using the AFLITE algorithm to minimize spurious statistical cues. It is significantly harder than the original Winograd Schema Challenge for contemporary NLP models.
AudioSet
by Google
Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.
CheXpert
by Stanford ML Group
CheXpert is a large chest X-ray dataset from Stanford containing 224,316 chest radiographs from 65,240 patients with labels for 14 observations mined from radiology reports using an automated labeler. It uniquely addresses label uncertainty with positive, negative, and uncertain labels, making it a challenging and realistic benchmark for automated chest X-ray interpretation.
PubMedCentral OA
by National Institutes of Health / National Library of Medicine
PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.
VoxCeleb2
by Oxford Visual Geometry Group (VGG)
VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.
FLORES-200 Dataset
by Meta AI
FLORES-200 is Meta's few-shot translation evaluation benchmark spanning 200 languages, including many low-resource and endangered ones. Each language contains 1,012 parallel sentences translated from English Wikipedia, covering both devtest and test splits for systematic MT evaluation at scale.
Alpaca Dataset
by Stanford University
Stanford Alpaca's 52,000 instruction-following examples generated using the self-instruct technique applied to GPT-3.5 (text-davinci-003). This foundational dataset enabled the creation of the Alpaca 7B model and popularized cost-effective instruction-tuning approaches.
Common Voice 15
by Mozilla
Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.
SEC-EDGAR Filings
by U.S. Securities and Exchange Commission
The SEC-EDGAR Filings dataset encompasses over 20 million full-text regulatory filings submitted to the US Securities and Exchange Commission since 1993, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, and proxy statements from all US public companies. It is the foundational corpus for financial NLP research, sentiment analysis, and financial document AI.
MBPP (Mostly Basic Python Problems)
by Google
A dataset of 974 crowd-sourced Python programming problems suitable for entry-level programmers, each with a problem description, code solution, and three automated test cases. MBPP complements HumanEval by covering a broader variety of programming concepts and is widely used alongside it for comprehensive evaluation of code generation capabilities across model families.
The Stack v2
by BigCode
An expanded code pretraining dataset containing 3 trillion tokens of source code in 619 programming languages, curated by BigCode from GitHub repositories with permissive SPDX licenses. Version 2 triples the size of the original Stack and includes improved deduplication, opt-out mechanisms for authors, and structured data from GitHub issues and pull requests alongside raw source files.
CelebA-HQ
by NVIDIA / CUHK
CelebA-HQ is a high-quality version of the CelebA face dataset containing 30,000 celebrity images at 1024×1024 resolution with 40 binary attribute annotations. It was introduced alongside Progressive GAN and has become the standard benchmark for high-fidelity face generation and synthesis research.
ArXiv Papers Dataset
by Cornell University / arXiv
The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.
OpenAssistant Conversations
by LAION
A large-scale, human-annotated dataset of assistant-style conversations collected through the OpenAssistant crowdsourcing platform. Contains over 161,000 messages across 66,000+ conversation trees, with ranked responses for RLHF training.
mC4
by Google
The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.
Semantic Scholar ORC
by Allen Institute for AI (AI2)
The Semantic Scholar Open Research Corpus (S2ORC) is a large English-language corpus of 136 million academic papers with structured metadata, abstracts, citation graphs, and full-text body paragraphs where licensing allows. Maintained by the Allen Institute for AI, it covers 19 scientific fields and is widely used for scientific NLP tasks including citation prediction, claim verification, and scientific QA.
BookCorpus
by University of Toronto
A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.
Places365
by MIT CSAIL
Places365 is a scene-centric database with 1.8 million training images across 365 scene categories, designed to train and evaluate scene recognition models. The dataset enables models to understand the semantic meaning of places and environments, making it ideal for applications in autonomous driving, robotics, and image retrieval.
CodeSearchNet
by GitHub / Microsoft Research
A dataset and benchmark challenge for code retrieval and search containing 2 million (code, documentation) pairs in six programming languages — Python, Java, JavaScript, PHP, Ruby, and Go — curated by GitHub and Microsoft Research. It is the canonical benchmark for code-to-natural-language and natural-language-to-code retrieval tasks and is widely used to evaluate code embedding models.
APPS (Automated Programming Progress Standard)
by UC Berkeley
A benchmark of 10,000 programming problems at introductory, interview, and competitive programming difficulty levels, each with problem statements, test cases, and human-written solutions. APPS is the standard dataset for evaluating code generation models on realistic programming tasks ranging from simple loops to complex algorithmic challenges drawn from competitive programming platforms.
UltraFeedback
by Tsinghua University
A large-scale, high-quality preference dataset with 64,000 instructions each answered by 4 LLMs and rated by GPT-4 on instruction-following, truthfulness, honesty, and helpfulness. UltraFeedback is the backbone of the Zephyr and Tulu 2 DPO models.
Financial PhraseBank
by Pekka Malo et al. / Aalto University
Financial PhraseBank is a sentiment analysis dataset containing 4,845 sentences from English-language financial news annotated by 16 financial domain experts with positive, negative, or neutral sentiment labels. It is the most widely used benchmark for financial sentiment analysis and has been used to fine-tune FinBERT and numerous other financial NLP models.
Self-Instruct
by University of Washington
Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.
StarCoderData
by BigCode
The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.
DM Mathematics
by Google DeepMind
DeepMind Mathematics (DM Mathematics) is a dataset of 2 million mathematical question-answer pairs covering algebra, arithmetic, calculus, comparisons, measurement, numbers, polynomials, and probability, procedurally generated to test mathematical reasoning capabilities of language models. The symbolic and step-structured nature of the dataset makes it a standard benchmark for evaluating compositional generalization and multi-step arithmetic reasoning.
LIMA
by Meta AI
LIMA (Less Is More for Alignment) is a carefully curated dataset of 1,000 high-quality instruction-response pairs demonstrating that alignment quality matters more than quantity. Sourced from StackExchange, wikiHow, and manually written prompts, LIMA-tuned models rival GPT-4 on many benchmarks.
OpenHermes 2.5
by Nous Research
A large curated synthetic instruction dataset with ~1 million entries sourced from multiple high-quality open datasets including Airoboros, Camel, GPT4-LLM, and others. OpenHermes 2.5 powers the Nous Hermes model family and is widely regarded as one of the best open instruction datasets.
OASST2
by LAION / OpenAssistant
OpenAssistant Conversations 2 (OASST2) is a crowd-sourced human-annotated dataset of 100,000+ assistant-style conversations in 35 languages, where human contributors created and ranked message trees to produce preference labels for RLHF training. It is the largest open multilingual human-feedback dataset and is widely used for training preference models and reward functions in open-source alignment pipelines.
NLLB Training Data
by Meta AI
The No Language Left Behind (NLLB) training corpus released by Meta AI contains high-quality parallel data across 200+ language pairs, including newly mined bitext for dozens of low-resource languages. It was used to train the NLLB-200 model achieving state-of-the-art translation on low-resource language pairs.
ShareGPT
by Community
A community-collected dataset of real ChatGPT and GPT-4 conversation logs shared by users, covering a broad range of tasks and domains. Available in multiple filtered and cleaned versions including ShareGPT52K and ShareGPT90K used by Vicuna and other open models.
LSUN
by Princeton / Columbia University
The Large-Scale Scene Understanding (LSUN) dataset is a massive collection of nearly one million labeled images for each of 10 scene and 20 object categories. It is a key benchmark for advancing research in scene understanding, particularly for generative modeling, classification, and reconstruction tasks.
Dolly-15K
by Databricks
Dolly-15K is a high-quality, open-source dataset of 15,000 instruction-following records generated by humans. Created by Databricks employees, it's designed for fine-tuning large language models to exhibit instruction-following capabilities, such as those seen in ChatGPT, using a relatively small, targeted dataset.
OPUS-100
by University of Helsinki
OPUS-100 is a large-scale multilingual parallel corpus for machine translation, featuring 100 languages pivoted through English. Sampled from the OPUS collection, it provides up to 1 million sentence pairs per language pair, making it a standard benchmark for training and evaluating multilingual models.
Phi-1 TextBooks
by Microsoft
Phi-1 TextBooks is a synthetic dataset of Python coding textbooks and exercises generated by GPT-3.5 and GPT-4. It was created to pretrain Microsoft's Phi-1 small language model, demonstrating that high-quality, curriculum-style data can significantly boost the coding abilities of smaller models compared to training on general web data.
GigaSpeech
by Seasalt.ai / SpeechColab
GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.
WizardLM Evol-Instruct
by Microsoft Research
WizardLM Evol-Instruct is a synthetic dataset created by Microsoft Research for fine-tuning large language models. It uses an LLM-based evolutionary process to iteratively rewrite and complicate a seed set of instructions, progressively increasing their complexity and diversity. The dataset is designed to enhance a model's ability to follow intricate, multi-step commands across various domains like coding, math, and reasoning.
TyDi QA Dataset
by Google Research
TyDi QA is a benchmark for question answering across 11 typologically diverse languages. It features information-seeking questions written by native speakers who have not seen the answer, ensuring real-world applicability. This design challenges models to generalize beyond high-resource, typologically similar languages.
DataComp-1B
by DataComp Consortium
A curated 1.28 billion image-text pair dataset produced through the DataComp benchmark competition, which challenged participants to filter a 12.8 billion pair candidate pool to produce the best downstream CLIP model. DataComp-1B represents the winning filtering strategy and achieves state-of-the-art zero-shot classification performance among datasets of its size.
OpenWebText
by EleutherAI
OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.
LAION-400M Text Captions
by LAION
The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.
BioASQ Dataset
by BioASQ Consortium
The BioASQ dataset is a benchmark for biomedical semantic indexing and question answering. It contains thousands of expert-annotated questions (factoid, list, yes/no, summary) paired with relevant PubMed articles, concepts, and ideal answers, designed to train and evaluate advanced NLP systems in the medical domain.
Open X-Embodiment
by Google DeepMind / Consortium
Open X-Embodiment (OXE) is a massive robotics dataset combining over 1 million demonstration episodes from 22 distinct robot embodiments. It covers 527 skills and is designed to train generalist robot policies that can transfer skills across diverse hardware, serving as a key resource for vision-language-action models.
Legal-BERT Training Data
by Gerasimos Spanakis / Maastricht University
The Legal-BERT training corpus is a large collection of English legal text assembled from UK legislation, EU legislation, ECHR/ECLI court decisions, and US contracts specifically curated to pretrain domain-adapted BERT models. It has enabled a family of Legal-BERT models that significantly outperform general-domain language models on legal NLP tasks.
GenLaw: A Legal Reasoning Dataset
by Stanford Center for Legal Informatics
GenLaw is a comprehensive dataset designed for evaluating legal reasoning capabilities of large language models. It contains a diverse set of legal questions, case summaries, and relevant statutes, enabling researchers to assess a model's ability to understand and apply legal principles.
SlimPajama
by Cerebras
SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.
EU Court Decisions
by European Court of Human Rights / CJEU
The EU Court Decisions dataset aggregates judgments from the European Court of Human Rights (ECHR) and the Court of Justice of the European Union (CJEU), covering tens of thousands of decisions in multiple EU languages with structured metadata. It is widely used for multilingual legal NLP research, legal judgment prediction, and cross-lingual information retrieval.
Evol-CodeAlpaca
by Microsoft Research
Evol-CodeAlpaca is a dataset of 110,000 instruction-solution pairs for code generation, created by applying the EvolInstruct method to Code Alpaca seeds. Using GPT-4, it progressively increases the complexity and diversity of programming problems, serving as the primary training data for the WizardCoder models.
ShareGPT4V
by Shanghai AI Lab
ShareGPT4V is a large-scale, high-quality dataset containing 100,000 image-text pairs generated by GPT-4V. It is specifically designed for the instruction-tuning of open-source large vision-language models (LVLMs). The dataset's detailed captions and conversational QA pairs significantly enhance a model's ability to perform complex scene understanding, OCR, and visual reasoning.
FinQA Dataset
by Zhiyu Chen et al. / University of California Santa Barbara
FinQA is a large-scale dataset for numerical reasoning over financial data, containing over 8,000 question-answer pairs from S&P 500 earnings reports. Each question requires multi-step reasoning across both unstructured text and structured tables, making it a challenging benchmark for financial AI systems.
Cosmopedia
by Hugging Face
Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.
CC12M (Conceptual 12M)
by Google
CC12M is a large-scale dataset by Google containing 12 million image-text pairs from the web. It was created with a less restrictive filtering process than its predecessor, CC3M, to achieve greater scale and diversity. This makes it a foundational resource for pretraining large vision-language models like CLIP and ALIGN.
XL-Sum Dataset
by BUET (Bangladesh University of Engineering and Technology)
XL-Sum is a massive multilingual dataset for abstractive summarization. It consists of over 1 million article-summary pairs scraped from BBC News, covering 44 different languages. This diversity makes it a crucial resource for developing and evaluating cross-lingual and multilingual summarization models.
CaseText Corpus
by Casetext (acquired by Thomson Reuters)
The CaseText Corpus is a large-scale dataset of US federal and state court decisions. It includes full text, structured metadata, and citation networks, designed for legal research and the development of AI applications like legal language models and case retrieval systems, spanning decades of US jurisprudence.
RLBench
by Dyson Robotics Lab / Imperial College London
RLBench is a large-scale robot learning benchmark and dataset built on the CoppeliaSim simulator, providing 100 unique manipulation tasks with demonstrations, observations, and reward functions. It offers RGB, depth, and point-cloud observations for a Franka Panda arm across diverse household tasks, widely used for evaluating imitation learning, reinforcement learning, and multi-task robot policies.
OpenMathInstruct
by NVIDIA
OpenMathInstruct is a large-scale, synthetic dataset by NVIDIA featuring 1.8M+ math problem-solution pairs. Generated by Mixtral models and verified for correctness, it provides reliable, step-by-step reasoning chains for training and fine-tuning language models on diverse mathematical topics, from arithmetic to competition math.
GitHub Code Dataset
by Hugging Face / BigCode
The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.
CC-News
by CommonCrawl Foundation
CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.
MusicNet
by University of Washington
MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.
CulturaX
by University of Oregon
CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.
Tulu V2 Mix
by Allen Institute for AI (AI2)
Tulu V2 Mix is a curated 326,000-sample mixture of instruction-tuning datasets from AI2. It blends diverse sources like FLAN, Open Assistant, and Code Alpaca to train the Tulu 2 model family. The dataset serves as a benchmark for analyzing the impact of different data sources on model performance and quality.
MedNLI
by University of Massachusetts / Partners Healthcare
MedNLI is a benchmark dataset for Natural Language Inference (NLI) in the clinical domain. Derived from the MIMIC-III database, it contains over 14,000 sentence pairs from clinical notes, each annotated by a clinician as representing entailment, contradiction, or a neutral relationship, enabling the evaluation of clinical text reasoning.
WebVid-10M
by University of Oxford
WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.
PushShift Reddit Dataset
by PushShift.io
A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.
Nectar
by UC Berkeley
Nectar is a large-scale, high-quality preference dataset from Berkeley AI Research (BAIR). It contains 183,000 prompts, each with seven ranked responses from diverse models like GPT-4, ChatGPT, and open-source LLMs. It is designed for training robust reward models for RLHF and DPO.
RoboNet
by Berkeley AI Research (BAIR)
RoboNet is a large-scale dataset for robot learning, featuring 15 million video frames from diverse robot arms across multiple labs. It is designed to train and benchmark self-supervised visual models, aiming to achieve generalization across different robot morphologies and workspaces without task-specific labels.
Orca DPO Pairs
by Intel Labs / Community
Orca DPO Pairs is a synthetic dataset containing 12,000 instruction-following examples. Each example includes a prompt, a high-quality response from GPT-4 (chosen), and a lower-quality response from GPT-3.5 (rejected). It is designed for efficiently aligning language models using Direct Preference Optimization (DPO) without a reward model.
CALVIN
by Albert-Ludwigs-Universität Freiburg
CALVIN is a large-scale dataset and benchmark for long-horizon, language-conditioned robot manipulation. It features over 24 hours of teleoperated demonstration data in a tabletop environment, encompassing 34 distinct skills that can be composed to solve complex, multi-step tasks from natural language instructions.
Deita 6K
by HKUST / Community
Deita 6K is an ultra-compact, high-quality instruction-tuning dataset of 6,000 carefully selected samples produced by the Data-Efficient Instruction Tuning for Alignment (DEITA) framework, which scores and filters instruction data by complexity and quality using LLM judges. Despite its small size, models trained on Deita 6K match or outperform those trained on datasets 10-100x larger, demonstrating the power of principled data selection over scale.
CAMEL-AI Datasets
by CAMEL-AI
The CAMEL-AI Datasets are a collection of synthetic multi-agent conversation datasets generated through the Communicative Agents framework, where AI assistants and user agents collaborate via role-playing to solve tasks. The collection covers coding, math, science, and open-ended reasoning domains, providing diverse instruction-following dialogues useful for SFT and alignment research.
CodeParrot GitHub Code
by Hugging Face
A 50 GB dataset of Python code scraped from GitHub, originally created to train the CodeParrot model as a demonstration of code-focused language model pretraining. It filters repositories for Python files only and applies basic deduplication, making it a lightweight starting point for Python-specific code generation research and experimentation.
Capybara
by Argilla / LDJnr
Capybara is a high-quality instruction-tuning dataset of 15,000 diverse, long-form single- and multi-turn conversations synthesized to cover a wide range of topics and response styles, designed to improve model coherence and verbosity on open-ended tasks. It emphasizes narrative quality and conceptual depth over simple factual responses, making it particularly effective for improving chat model fluency and reasoning.
Genstruct
by NousResearch
Genstruct is a synthetic instruction dataset generated by the Genstruct-7B model, which converts raw documents into structured instruction-response pairs. Unlike typical self-instruct approaches, Genstruct grounds every instruction in a source document, ensuring factual consistency and enabling controllable synthetic data generation from any text corpus.
UltraChat
by Tsinghua University
1.5M high-quality multi-turn dialogue dataset for instruction fine-tuning.
The Pile
by EleutherAI
825GB diverse English pretraining corpus from 22 high-quality data sources.
SWE-bench
by Princeton NLP
2.3K real GitHub issues requiring AI agents to write and verify code fixes.