Explore.
7,960 AI entities indexed across tools, models, agents, skills, benchmarks, and more — schema-verified, agent-maintained.
91 entities · benchmark
MLPerf Training
by MLCommons
MLPerf Training is a suite of benchmarks that measure the time it takes to train various machine learning models on different hardware and software platforms. It provides a standardized way to compare the performance of different AI training systems, driving innovation in hardware and software optimization for AI workloads.
HELM: Holistic Evaluation of Language Models
by Stanford Center for Research on Foundation Models (CRFM)
HELM is a living benchmark designed to provide a comprehensive and holistic evaluation of language models across a wide range of scenarios and metrics. It aims to move beyond single-number evaluations by assessing models on factors like truthfulness, calibration, fairness, robustness, and efficiency, providing a more nuanced understanding of their capabilities and limitations.
ImageNet
by Deng et al. / Stanford / Princeton
ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.
RoboSuite
by Stanford AI Lab
RoboSuite is a simulation framework and benchmark suite for robot learning. It provides a standardized set of environments and tasks for training and evaluating reinforcement learning algorithms in robotics, focusing on manipulation and locomotion tasks with realistic physics and sensor models.
AI2 Reasoning Challenge (ARC)
by Allen Institute for AI (AI2)
The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to evaluate advanced reasoning capabilities in AI systems. It consists of elementary-level science questions specifically crafted to be difficult for retrieval-based methods and require deeper understanding and reasoning to answer correctly.
COCO Detection
by Lin et al. / Microsoft
COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.
LibriSpeech
by Panayotov et al. / Johns Hopkins
LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.
ADE20K Segmentation
by Zhou et al. / MIT CSAIL
ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.
GSM8K
by OpenAI
Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.
SWE-bench Verified
by Princeton NLP
Human-validated subset of SWE-bench containing 500 problems verified by software engineers for correctness, clarity, and solvability. Provides a more reliable signal than the full SWE-bench by filtering out ambiguous or under-specified issues.
MATH
by UC Berkeley
Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.
ARC-AGI
by Chollet / ARC Prize Foundation
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.
HellaSwag
by Allen AI
Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.
Common Voice
by Mozilla Foundation
Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.
MLPerf Inference
by MLCommons
MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.
ARC Challenge
by Allen AI
AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.
MedQA
by Jin et al. / UC San Diego
MedQA tests medical knowledge using free-form multiple-choice questions drawn from the US Medical Licensing Examination (USMLE). It evaluates whether language models can reason through complex clinical scenarios requiring deep biomedical knowledge.
MT-Bench
by LMSYS
Multi-turn conversation benchmark with 80 high-quality questions across 8 categories including writing, reasoning, math, coding, and extraction. Uses GPT-4 as an automated judge to evaluate response quality on a 1-10 scale across two conversation turns.
FLORES-200
by NLLB Team / Meta AI
FLORES-200 is a many-to-many multilingual translation benchmark covering 200 languages, including many low-resource ones. It evaluates machine translation systems across 40,000 language direction pairs, making it the most comprehensive translation benchmark for assessing cross-lingual generalization.
GPQA
by NYU
Graduate-level Google-Proof Question Answering benchmark featuring questions written by domain experts in physics, chemistry, and biology. Questions are designed to be unsearchable, requiring genuine reasoning rather than memorization.
TruthfulQA
by University of Oxford
Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.
MBPP
by Google Research
Mostly Basic Programming Problems — a collection of 974 crowd-sourced Python programming tasks with natural language descriptions and test cases. Tests foundational programming ability including string manipulation, list processing, and basic algorithms.
Flickr30k
by Young et al. / University of Illinois
Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.
Needle-in-a-Haystack
by Greg Kamradt (community)
Needle-in-a-Haystack is a pressure test for long-context language models that places a single fact (the needle) at a specific position within a long document (the haystack) and asks the model to retrieve it. It systematically varies both context length and needle depth to reveal performance degradation patterns.
VQA v2
by Georgia Tech / VT
Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.
BIG-Bench Hard
by Google DeepMind
Curated subset of 23 challenging BIG-Bench tasks where prior language models performed below average human raters. Specifically designed to test tasks that benefit significantly from chain-of-thought prompting and multi-step reasoning.
WinoGrande
by Allen AI
Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.
RealToxicityPrompts
by Gehman et al. / Allen Institute for AI
RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.
PubMedQA
by Jin et al. / Carnegie Mellon University
PubMedQA is a biomedical question-answering dataset sourced from PubMed abstracts. Models must answer yes/no/maybe questions about biomedical research findings, testing the ability to reason over scientific literature.
LegalBench
by Guha et al. / Stanford CodeX
LegalBench is a collaboratively built benchmark measuring the legal reasoning ability of large language models across 162 tasks spanning issue spotting, rule recall, rule application, and legal interpretation. It provides a comprehensive evaluation of whether models can reason like lawyers.
ScienceQA
by Lu et al. / UCLA
ScienceQA is a large-scale multimodal benchmark featuring 21,208 science questions for grades 3-12. It uniquely combines visual diagrams and textual contexts, requiring models to perform complex reasoning. Each question includes multiple-choice options, a detailed lecture, and a step-by-step explanation for the correct answer.
AlpacaEval
by Stanford
Automated evaluation framework comparing model outputs against a reference model on 805 instructions. Uses LLM judges to determine win rates, with length-controlled metrics to avoid rewarding verbosity over quality.
BioASQ
by Tsatsaronis et al. / BioASQ Challenge
BioASQ is a large-scale benchmark for biomedical semantic question answering. It challenges systems to perform document retrieval, concept mapping, and answer extraction from PubMed literature. The benchmark includes diverse question types like yes/no, factoid, list, and summary, with gold-standard answers curated by experts.
MMLU-Pro
by TIGER-Lab
MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.
ToolBench
by Qin et al. / Tsinghua University
ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.
MMMU
by CUHK / Waterloo
MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.
DROP
by Allen AI
DROP (Discrete Reasoning Over Paragraphs) is a challenging benchmark designed to evaluate a model's numerical reasoning capabilities within textual contexts. It requires systems to read paragraphs and answer questions that involve discrete operations like addition, counting, sorting, or comparison. Unlike simpler QA datasets, DROP necessitates multi-step reasoning processes, pushing models beyond basic information retrieval.
ToxiGen
by Hartvigsen et al. / MIT
ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.
BigCodeBench
by Zhuo et al. / BigCode / Hugging Face
BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.
TyDi QA
by Clark et al. / Google Research
TyDi QA is a multilingual question-answering benchmark featuring 11 typologically diverse languages. Questions are written natively by speakers of each language, ensuring genuine linguistic challenges and avoiding translation artifacts. It is designed to evaluate reading comprehension across a wide range of language structures.
MedMCQA
by Pal et al. / IIT Kanpur
MedMCQA is a massive multiple-choice question dataset sourced from Indian medical entrance examinations like AIIMS and NEET-PG. It contains over 194,000 questions covering 2,400 healthcare topics, designed to rigorously test a model's breadth of medical knowledge and reasoning abilities across multiple subjects.
RULER
by Hsieh et al. / NVIDIA
RULER is a synthetic benchmark for evaluating large language models in long-context scenarios, scaling from 4K to 128K tokens. It assesses complex skills like multi-hop retrieval, aggregation, and coreference resolution, offering a more nuanced analysis than simple 'needle-in-a-haystack' tests.
AIME 2024
by MAA
A highly challenging benchmark for evaluating the mathematical reasoning of frontier AI models. It uses 30 problems from the 2024 American Invitational Mathematics Examination (AIME), which are designed to test creative problem-solving, multi-step deduction, and knowledge across number theory, geometry, algebra, and combinatorics.
BBQ (Bias Benchmark for QA)
by Parrish et al. / NYU
BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.
LongBench
by Bai et al. / Tsinghua University
LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.
MusicCaps
by Agostinelli et al. / Google DeepMind
MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.
IFEval
by Google Research
Instruction-Following Evaluation benchmark testing models' ability to precisely follow verifiable formatting instructions. Includes constraints like word count limits, specific formatting requirements, keyword inclusion/exclusion, and structural rules that can be programmatically verified.
Chatbot Arena Hard
by LMSYS
Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.
HumanEval+
by BigCode
HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.
CyberSecEval
by Meta AI
CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.
DocVQA
by CVC Barcelona
DocVQA is a large-scale dataset and benchmark for Visual Question Answering on document images. It challenges models to answer questions by reading and interpreting text, understanding layouts, and reasoning about information within complex documents like forms, invoices, and reports. It serves as a standard for evaluating document intelligence systems.
FinanceBench
by Islam et al. / Patronus AI
FinanceBench is a benchmark designed to evaluate the financial question-answering capabilities of Large Language Models. It uses publicly available corporate documents like 10-K filings and earnings reports to test models on information retrieval, numerical reasoning, and multi-step financial calculations, providing a standardized testbed for financial AI.
WebArena
by CMU
WebArena is a realistic and reproducible benchmark environment designed to evaluate autonomous language agents. It tests an agent's ability to perform complex, multi-step tasks across a diverse set of self-hosted websites, including e-commerce, forums, and content management systems, using real web interfaces.
XL-Sum
by Hasan et al. / University of Edinburgh
XL-Sum is a large-scale benchmark dataset for multilingual abstractive summarization. It contains 1.35 million article-summary pairs from BBC News across 44 languages, designed to evaluate a model's ability to generate concise summaries across diverse linguistic families and writing systems.
GAIA Benchmark
by Meta / Hugging Face
GAIA (General AI Assistants) is a benchmark for evaluating AI models on complex, real-world tasks. It features questions with unambiguous factual answers that require sophisticated capabilities like multi-step reasoning, web browsing, and tool use. GAIA is designed to test the practical limits of general-purpose AI assistants.
CrowS-Pairs
by Nangia et al. / NYU
CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.
MGSM
by Google Research
MGSM (Multilingual Grade School Math) is a benchmark for evaluating the mathematical reasoning of large language models across multiple languages. It consists of 250 grade-school math problems from the GSM8K dataset, professionally translated into ten typologically diverse languages, including low-resource ones like Swahili and Telugu.
AgentBoard
by Ma et al. / Shanghai AI Lab
AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.
ContractNLI
by Koreeda & Manning / Stanford NLP
ContractNLI is a dataset for natural language inference (NLI) focused on contract understanding. It challenges models to determine if a hypothesis about a contract is entailed, contradicted, or not mentioned by the contract text. This simulates real-world legal document review, testing a model's ability to reason over complex legal language.
SimpleQA
by OpenAI
SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.
Humanity's Last Exam
by CAIS
Humanity's Last Exam is a crowdsourced benchmark designed to rigorously test the limits of advanced AI systems. It comprises extremely difficult questions contributed by domain experts across diverse fields like science, math, and philosophy, serving as a public evaluation for frontier model capabilities in complex reasoning and specialized knowledge.
WinoBias
by Zhao et al. / USC
WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.
InfiniteBench
by Zhang et al. / Peking University
InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.
AgentBench
by Tsinghua University
Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.
Minerva Math
by Google Research
Minerva Math is a quantitative reasoning benchmark designed to evaluate large language models on complex STEM problems. Sourced from web pages with LaTeX and arXiv preprints, it covers subjects like math, physics, and chemistry, requiring multi-step computation, symbolic manipulation, and deep scientific understanding to solve.
CaseHOLD
by Zheng et al. / Berkeley Law / LexGLUE
CaseHOLD is a legal NLP benchmark for evaluating a model's ability to identify the correct holding statement for a US court case. Given a citing context, the model must choose the correct holding from a list of candidates. Sourced from over 53,000 cases, it is a core component of the LexGLUE benchmark suite for legal AI.
API-Bank
by Li et al. / Wuhan University
API-Bank is a comprehensive benchmark for evaluating tool-augmented LLMs. It features 73 diverse APIs and assesses models on three levels: API retrieval, API calling, and complex planning. The benchmark measures both the correctness of tool selection and the accuracy of execution, providing a thorough test of an agent's capabilities.
MathVista
by UCLA
Mathematical reasoning benchmark requiring visual understanding of charts, plots, geometry diagrams, and infographics. Tests the intersection of visual perception and mathematical reasoning with 6,141 problems from 28 existing datasets and 3 newly collected ones.
Aider Polyglot
by Aider
Multi-language code editing benchmark testing models' ability to make targeted code changes across Python, JavaScript, TypeScript, Java, C++, and other languages. Evaluates real-world code modification tasks rather than generation from scratch.
MLAgentBench
by Huang et al. / Stanford
MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.
FrontierMath
by Epoch AI
Benchmark of original, research-level mathematics problems created by professional mathematicians. Tests capabilities at the frontier of mathematical reasoning including novel proofs, advanced computation, and multi-domain mathematical synthesis.
ClinicalCamel Benchmark
by Toma et al. / University of Toronto
ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.
Codeforces Benchmark
by Codeforces / Community
Evaluates models on competitive programming problems from the Codeforces platform across difficulty ratings. Tests algorithmic thinking, data structure knowledge, and the ability to produce correct and efficient solutions under competitive constraints.
TAU-bench
by Sierra AI
Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.
MLE-bench
by OpenAI
Benchmark evaluating AI agents on real Kaggle machine learning competitions. Tests the full ML engineering pipeline including data exploration, feature engineering, model selection, training, and submission formatting against actual competition leaderboards.
OSWorld
by University of Hong Kong
Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.
RealWorldQA
by xAI
Benchmark testing multimodal models on practical real-world visual understanding tasks. Features questions about real photographs requiring spatial reasoning, object recognition, scene understanding, and practical knowledge that goes beyond simple object detection.
EnergyBench
by Lannelongue et al. / EMBL-EBI
EnergyBench quantifies the energy consumption and carbon footprint of AI inference across hardware and software configurations. It correlates task accuracy with joules consumed, enabling practitioners to make informed accuracy-efficiency trade-offs for sustainable AI deployment.
GreenAI Benchmark
by Schwartz et al. / AI2 / University of Washington
GreenAI Benchmark evaluates the efficiency of AI training and inference by reporting accuracy alongside FLOPs, parameters, and CO2 emissions. It promotes the efficiency metric paradigm where reporting results without computational cost is considered incomplete science.
SWE-bench
by Princeton NLP
SWE-bench is a benchmark for evaluating AI systems' ability to resolve real GitHub issues from popular Python repositories. Each instance requires understanding a codebase, identifying the bug, and producing a correct patch. SWE-bench Verified is the curated subset accepted as the standard for coding agent evaluation by the AI industry.
MTEB
by Hugging Face / MTEB Team
MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating text embedding models across 8 task types (retrieval, clustering, classification, etc.) and 112 datasets. The MTEB leaderboard on Hugging Face is the primary reference for selecting embedding models and is updated continuously as new models are released.
MMLU
by UC Berkeley
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark covering 57 academic subjects from elementary to professional level, including STEM, law, medicine, and social sciences. It became the standard for measuring general knowledge breadth in LLMs and is included in virtually every model evaluation suite.
LiveBench
by LiveBench OSS
LiveBench is a contamination-resistant benchmark that continuously updates with new questions sourced from recent math competitions, research papers, and news. By using only data post-dating model training cutoffs, LiveBench mitigates benchmark saturation and provides more reliable capability assessments of frontier models.
HumanEval
by OpenAI
HumanEval is OpenAI's code generation benchmark consisting of 164 hand-written Python programming problems with unit tests. It measures a model's ability to generate syntactically correct and functionally complete code from docstring descriptions. HumanEval is the foundational coding benchmark that all subsequent code benchmarks build upon.
HELM
by Stanford CRFM
HELM (Holistic Evaluation of Language Models) from Stanford CRFM provides a multi-dimensional evaluation framework that measures LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It evaluates models on 42 scenarios and 59 metrics, providing the most comprehensive public assessment of LLM capabilities and risks.
GPQA Diamond
by NYU / Cohere
GPQA Diamond (Graduate-Level Google-Proof Q&A) is a challenging multiple-choice benchmark requiring expert-level knowledge in biology, chemistry, and physics. Questions are designed to be answerable by domain PhD students but not by web search. GPQA Diamond is the standard for measuring frontier scientific reasoning capability.
Chatbot Arena
by LMSys
Chatbot Arena is a crowdsourced human evaluation platform from LMSys where users anonymously compare responses from two random LLMs and vote for the better one. The resulting Elo-based leaderboard (LMSYS Leaderboard) is widely regarded as the most reliable measure of real-world LLM preference across diverse user tasks.
ARC-AGI-2
by ARC Prize Foundation
ARC-AGI-2 is the second iteration of François Chollet's Abstraction and Reasoning Corpus benchmark, designed to measure fluid intelligence and generalization in AI systems. Tasks require identifying abstract visual patterns that cannot be solved by memorization, targeting a capability gap that separates current LLMs from human-level reasoning.
AIME 2025
by MAA / Community Eval
AIME (American Invitational Mathematics Examination) 2025 is used as a frontier math reasoning benchmark for LLMs. The competition-level math problems require multi-step reasoning without lookup, making AIME scores a direct indicator of a model's mathematical problem-solving depth. Frontier models are evaluated on the 2025 problem set to avoid training data contamination.
MATH-500
by
Mathematics benchmark testing advanced problem-solving from algebra to competition mathematics.
Arena-Hard Auto
by
Automated benchmark derived from Chatbot Arena for evaluating instruction-following and open-ended generation.