Explore.

traininghardwareperformance

MLPerf Training

by MLCommons

MLPerf Training is a suite of benchmarks that measure the time it takes to train various machine learning models on different hardware and software platforms. It provides a standardized way to compare the performance of different AI training systems, driving innovation in hardware and software optimization for AI workloads.

89.3A

language-modelsevaluationholistic

HELM: Holistic Evaluation of Language Models

by Stanford Center for Research on Foundation Models (CRFM)

HELM is a living benchmark designed to provide a comprehensive and holistic evaluation of language models across a wide range of scenarios and metrics. It aims to move beyond single-number evaluations by assessing models on factors like truthfulness, calibration, fairness, robustness, and efficiency, providing a more nuanced understanding of their capabilities and limitations.

87A

image-classificationvisiontop-1-accuracy

ImageNet

by Deng et al. / Stanford / Princeton

ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.

81.2A

roboticsreinforcement-learningsimulation

RoboSuite

by Stanford AI Lab

RoboSuite is a simulation framework and benchmark suite for robot learning. It provides a standardized set of environments and tasks for training and evaluating reinforcement learning algorithms in robotics, focusing on manipulation and locomotion tasks with realistic physics and sensor models.

80.9A

reasoningquestion-answeringscience

AI2 Reasoning Challenge (ARC)

by Allen Institute for AI (AI2)

The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to evaluate advanced reasoning capabilities in AI systems. It consists of elementary-level science questions specifically crafted to be difficult for retrieval-based methods and require deeper understanding and reasoning to answer correctly.

80.7A

object-detectioninstance-segmentationvision

COCO Detection

by Lin et al. / Microsoft

COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.

80.2A

BenchmarkSpeech & Audio AI

LibriSpeech

by Panayotov et al. / Johns Hopkins

LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.

asrspeech-recognitionenglish

79B+

semantic-segmentationscene-parsingvision

ADE20K Segmentation

by Zhou et al. / MIT CSAIL

ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.

76B+

GSM8K

by OpenAI

Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.

benchmarkevaluationmath

75.7B+

benchmarkevaluationsoftware-engineering

SWE-bench Verified

by Princeton NLP

Human-validated subset of SWE-bench containing 500 problems verified by software engineers for correctness, clarity, and solvability. Provides a more reliable signal than the full SWE-bench by filtering out ambiguous or under-specified issues.

74.4B+

benchmarkevaluationmathematics

MATH

by UC Berkeley

Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.

74.4B+

agiabstract-reasoningvisual-patterns

ARC-AGI

by Chollet / ARC Prize Foundation

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.

74.1B+

benchmarkevaluationcommonsense

HellaSwag

by Allen AI

Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.

74B+

BenchmarkSpeech & Audio AI

Common Voice

by Mozilla Foundation

Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.

asrmultilingualcrowdsourced

73.5B+

inferencethroughputlatency

MLPerf Inference

by MLCommons

MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.

73.1B+

benchmarkevaluationscience

ARC Challenge

by Allen AI

AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.

73.1B+

MedQA

by Jin et al. / UC San Diego

MedQA tests medical knowledge using free-form multiple-choice questions drawn from the US Medical Licensing Examination (USMLE). It evaluates whether language models can reason through complex clinical scenarios requiring deep biomedical knowledge.

medicalqaclinical

72.8B+

benchmarkevaluationmulti-turn

MT-Bench

by LMSYS

Multi-turn conversation benchmark with 80 high-quality questions across 8 categories including writing, reasoning, math, coding, and extraction. Uses GPT-4 as an automated judge to evaluate response quality on a 1-10 scale across two conversation turns.

72.2B+

translationmultilinguallow-resource

FLORES-200

by NLLB Team / Meta AI

FLORES-200 is a many-to-many multilingual translation benchmark covering 200 languages, including many low-resource ones. It evaluates machine translation systems across 40,000 language direction pairs, making it the most comprehensive translation benchmark for assessing cross-lingual generalization.

72.2B+

benchmarkevaluationgraduate-level

GPQA

by NYU

Graduate-level Google-Proof Question Answering benchmark featuring questions written by domain experts in physics, chemistry, and biology. Questions are designed to be unsearchable, requiring genuine reasoning rather than memorization.

71.6B+

benchmarkevaluationtruthfulness

TruthfulQA

by University of Oxford

Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.

71B+

benchmarkevaluationcoding

MBPP

by Google Research

Mostly Basic Programming Problems — a collection of 974 crowd-sourced Python programming tasks with natural language descriptions and test cases. Tests foundational programming ability including string manipulation, list processing, and basic algorithms.

71B+

image-captioningvisual-groundingretrieval

Flickr30k

by Young et al. / University of Illinois

Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.

70.9B+

long-contextretrievalsingle-fact

Needle-in-a-Haystack

by Greg Kamradt (community)

Needle-in-a-Haystack is a pressure test for long-context language models that places a single fact (the needle) at a specific position within a long document (the haystack) and asks the model to retrieve it. It systematically varies both context length and needle depth to reveal performance degradation patterns.

70.4B+

VQA v2

by Georgia Tech / VT

Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.

70.3B+

benchmarkevaluationreasoning

BIG-Bench Hard

by Google DeepMind

Curated subset of 23 challenging BIG-Bench tasks where prior language models performed below average human raters. Specifically designed to test tasks that benefit significantly from chain-of-thought prompting and multi-step reasoning.

70.1B+

benchmarkevaluationcommonsense

WinoGrande

by Allen AI

Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.

69.7B

RealToxicityPrompts

by Gehman et al. / Allen Institute for AI

RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.

toxicitygenerationsafety

69.7B

medicalbiomedicalresearch

PubMedQA

by Jin et al. / Carnegie Mellon University

PubMedQA is a biomedical question-answering dataset sourced from PubMed abstracts. Models must answer yes/no/maybe questions about biomedical research findings, testing the ability to reason over scientific literature.

68.4B

LegalBench

by Guha et al. / Stanford CodeX

LegalBench is a collaboratively built benchmark measuring the legal reasoning ability of large language models across 162 tasks spanning issue spotting, rule recall, rule application, and legal interpretation. It provides a comprehensive evaluation of whether models can reason like lawyers.

legalreasoningnlp

68.3B

benchmarkscience-qamultimodal-reasoning

ScienceQA

by Lu et al. / UCLA

ScienceQA is a large-scale multimodal benchmark featuring 21,208 science questions for grades 3-12. It uniquely combines visual diagrams and textual contexts, requiring models to perform complex reasoning. Each question includes multiple-choice options, a detailed lecture, and a step-by-step explanation for the correct answer.

68B

benchmarkevaluationinstruction-following

AlpacaEval

by Stanford

Automated evaluation framework comparing model outputs against a reference model on 805 instructions. Uses LLM judges to determine win rates, with length-controlled metrics to avoid rewarding verbosity over quality.

67.9B

biomedicalquestion-answeringinformation-retrieval

BioASQ

by Tsatsaronis et al. / BioASQ Challenge

BioASQ is a large-scale benchmark for biomedical semantic question answering. It challenges systems to perform document retrieval, concept mapping, and answer extraction from PubMed literature. The benchmark includes diverse question types like yes/no, factoid, list, and summary, with gold-standard answers curated by experts.

67.7B

benchmarkmodel-evaluationllm-testing

MMLU-Pro

by TIGER-Lab

MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.

67.2B

ToolBench

by Qin et al. / Tsinghua University

ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.

tool-useapiagents

67B

MMMU

by CUHK / Waterloo

MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.

66.9B

benchmarkdatasetevaluation

DROP

by Allen AI

DROP (Discrete Reasoning Over Paragraphs) is a challenging benchmark designed to evaluate a model's numerical reasoning capabilities within textual contexts. It requires systems to read paragraphs and answer questions that involve discrete operations like addition, counting, sorting, or comparison. Unlike simpler QA datasets, DROP necessitates multi-step reasoning processes, pushing models beyond basic information retrieval.

66.7B

toxicity-detectionhate-speechimplicit-bias

ToxiGen

by Hartvigsen et al. / MIT

ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.

66.4B

benchmarkcode-generationllm-evaluation

BigCodeBench

by Zhuo et al. / BigCode / Hugging Face

BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.

66.3B

question-answeringmultilingualtypologically-diverse

TyDi QA

by Clark et al. / Google Research

TyDi QA is a multilingual question-answering benchmark featuring 11 typologically diverse languages. Questions are written natively by speakers of each language, ensuring genuine linguistic challenges and avoiding translation artifacts. It is designed to evaluate reading comprehension across a wide range of language structures.

66.1B

MedMCQA

by Pal et al. / IIT Kanpur

MedMCQA is a massive multiple-choice question dataset sourced from Indian medical entrance examinations like AIIMS and NEET-PG. It contains over 194,000 questions covering 2,400 healthcare topics, designed to rigorously test a model's breadth of medical knowledge and reasoning abilities across multiple subjects.

medicalmcqindian-medical

65.5B

long-context-evaluationllm-benchmarkretrieval-testing

RULER

by Hsieh et al. / NVIDIA

RULER is a synthetic benchmark for evaluating large language models in long-context scenarios, scaling from 4K to 128K tokens. It assesses complex skills like multi-hop retrieval, aggregation, and coreference resolution, offering a more nuanced analysis than simple 'needle-in-a-haystack' tests.

65.2B

benchmarkmodel-evaluationmathematics

AIME 2024

by MAA

A highly challenging benchmark for evaluating the mathematical reasoning of frontier AI models. It uses 30 problems from the 2024 American Invitational Mathematics Examination (AIME), which are designed to test creative problem-solving, multi-step deduction, and knowledge across number theory, geometry, algebra, and combinatorics.

64.9B

BBQ (Bias Benchmark for QA)

by Parrish et al. / NYU

BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.

biasqasocial-bias

64.6B

long-contextbilingualmulti-task

LongBench

by Bai et al. / Tsinghua University

LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.

64.5B

BenchmarkSpeech & Audio AI

MusicCaps

by Agostinelli et al. / Google DeepMind

MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.

musicaudio-captioningmultimodal

64.3B

benchmarkevaluationinstruction-following

IFEval

by Google Research

Instruction-Following Evaluation benchmark testing models' ability to precisely follow verifiable formatting instructions. Includes constraints like word count limits, specific formatting requirements, keyword inclusion/exclusion, and structural rules that can be programmatically verified.

64.3B

Chatbot Arena Hard

by LMSYS

Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.

benchmarkevaluationchat

63.9B

benchmarkevaluationcoding

HumanEval+

by BigCode

HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.

63.8B

cybersecurityai-safetyllm-evaluation

CyberSecEval

by Meta AI

CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.

63.8B

benchmarkdatasetdocument-ai

DocVQA

by CVC Barcelona

DocVQA is a large-scale dataset and benchmark for Visual Question Answering on document images. It challenges models to answer questions by reading and interpreting text, understanding layouts, and reasoning about information within complex documents like forms, invoices, and reports. It serves as a standard for evaluating document intelligence systems.

63.1B

financeragnumerical-reasoning

FinanceBench

by Islam et al. / Patronus AI

FinanceBench is a benchmark designed to evaluate the financial question-answering capabilities of Large Language Models. It uses publicly available corporate documents like 10-K filings and earnings reports to test models on information retrieval, numerical reasoning, and multi-step financial calculations, providing a standardized testbed for financial AI.

62.8B

benchmarkagent-evaluationweb-benchmark

WebArena

by CMU

WebArena is a realistic and reproducible benchmark environment designed to evaluate autonomous language agents. It tests an agent's ability to perform complex, multi-step tasks across a diverse set of self-hosted websites, including e-commerce, forums, and content management systems, using real web interfaces.

62.4B

summarizationmultilingualnews

XL-Sum

by Hasan et al. / University of Edinburgh

XL-Sum is a large-scale benchmark dataset for multilingual abstractive summarization. It contains 1.35 million article-summary pairs from BBC News across 44 languages, designed to evaluate a model's ability to generate concise summaries across diverse linguistic families and writing systems.

62.2B

GAIA Benchmark

by Meta / Hugging Face

GAIA (General AI Assistants) is a benchmark for evaluating AI models on complex, real-world tasks. It features questions with unambiguous factual answers that require sophisticated capabilities like multi-step reasoning, web browsing, and tool use. GAIA is designed to test the practical limits of general-purpose AI assistants.

62.2B

CrowS-Pairs

by Nangia et al. / NYU

CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.

biasstereotypesmasked-lm

62B

MGSM

by Google Research

MGSM (Multilingual Grade School Math) is a benchmark for evaluating the mathematical reasoning of large language models across multiple languages. It consists of 250 grade-school math problems from the GSM8K dataset, professionally translated into ten typologically diverse languages, including low-resource ones like Swahili and Telugu.

benchmarkevaluationmath

61.4B

agent-evaluationllm-benchmarkmulti-task-evaluation

AgentBoard

by Ma et al. / Shanghai AI Lab

AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.

61.1B

ContractNLI

by Koreeda & Manning / Stanford NLP

ContractNLI is a dataset for natural language inference (NLI) focused on contract understanding. It challenges models to determine if a hypothesis about a contract is entailed, contradicted, or not mentioned by the contract text. This simulates real-world legal document review, testing a model's ability to reason over complex legal language.

legalnlicontract

60.8B

benchmarkevaluationfactuality

SimpleQA

by OpenAI

SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.

60.4B

benchmarkevaluationfrontier-testing

Humanity's Last Exam

by CAIS

Humanity's Last Exam is a crowdsourced benchmark designed to rigorously test the limits of advanced AI systems. It comprises extremely difficult questions contributed by domain experts across diverse fields like science, math, and philosophy, serving as a public evaluation for frontier model capabilities in complex reasoning and specialized knowledge.

60.2B

biasgender-biascoreference

WinoBias

by Zhao et al. / USC

WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.

59.8C+

long-contextllm-evaluationbenchmark

InfiniteBench

by Zhang et al. / Peking University

InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.

59.6C+

AgentBench

by Tsinghua University

Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.

59.3C+

benchmarkevaluationmathematics

Minerva Math

by Google Research

Minerva Math is a quantitative reasoning benchmark designed to evaluate large language models on complex STEM problems. Sourced from web pages with LaTeX and arXiv preprints, it covers subjects like math, physics, and chemistry, requiring multi-step computation, symbolic manipulation, and deep scientific understanding to solve.

58.9C+

legal-nlpbenchmarkcase-law

CaseHOLD

by Zheng et al. / Berkeley Law / LexGLUE

CaseHOLD is a legal NLP benchmark for evaluating a model's ability to identify the correct holding statement for a US court case. Given a citing context, the model must choose the correct holding from a list of candidates. Sourced from over 53,000 cases, it is a core component of the LexGLUE benchmark suite for legal AI.

58.8C+

API-Bank

by Li et al. / Wuhan University

API-Bank is a comprehensive benchmark for evaluating tool-augmented LLMs. It features 73 diverse APIs and assesses models on three levels: API retrieval, API calling, and complex planning. The benchmark measures both the correctness of tool selection and the accuracy of execution, providing a thorough test of an agent's capabilities.

tool-useapi-callagents

58.8C+

MathVista

by UCLA

Mathematical reasoning benchmark requiring visual understanding of charts, plots, geometry diagrams, and infographics. Tests the intersection of visual perception and mathematical reasoning with 6,141 problems from 28 existing datasets and 3 newly collected ones.

58.3C+

benchmarkevaluationcoding

Aider Polyglot

by Aider

Multi-language code editing benchmark testing models' ability to make targeted code changes across Python, JavaScript, TypeScript, Java, C++, and other languages. Evaluates real-world code modification tasks rather than generation from scratch.

58.2C+

MLAgentBench

by Huang et al. / Stanford

MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.

agentsml-researchcoding

57.9C+

benchmarkevaluationmathematics

FrontierMath

by Epoch AI

Benchmark of original, research-level mathematics problems created by professional mathematicians. Tests capabilities at the frontier of mathematical reasoning including novel proofs, advanced computation, and multi-domain mathematical synthesis.

55.9C+

medicalclinicalinstruction-following

ClinicalCamel Benchmark

by Toma et al. / University of Toronto

ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.

55.9C+

benchmarkevaluationcompetitive-programming

Codeforces Benchmark

by Codeforces / Community

Evaluates models on competitive programming problems from the Codeforces platform across difficulty ratings. Tests algorithmic thinking, data structure knowledge, and the ability to produce correct and efficient solutions under competitive constraints.

55.7C+

TAU-bench

by Sierra AI

Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.

54.8C+

benchmarkevaluationmachine-learning

MLE-bench

by OpenAI

Benchmark evaluating AI agents on real Kaggle machine learning competitions. Tests the full ML engineering pipeline including data exploration, feature engineering, model selection, training, and submission formatting against actual competition leaderboards.

54.8C+

OSWorld

by University of Hong Kong

Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.

53.7C+

RealWorldQA

by xAI

Benchmark testing multimodal models on practical real-world visual understanding tasks. Features questions about real photographs requiring spatial reasoning, object recognition, scene understanding, and practical knowledge that goes beyond simple object detection.

53.1C+

energyefficiencysustainability

EnergyBench

by Lannelongue et al. / EMBL-EBI

EnergyBench quantifies the energy consumption and carbon footprint of AI inference across hardware and software configurations. It correlates task accuracy with joules consumed, enabling practitioners to make informed accuracy-efficiency trade-offs for sustainable AI deployment.

49C

GreenAI Benchmark

by Schwartz et al. / AI2 / University of Washington

GreenAI Benchmark evaluates the efficiency of AI training and inference by reporting accuracy alongside FLOPs, parameters, and CO2 emissions. It promotes the efficiency metric paradigm where reporting results without computational cost is considered incomplete science.

green-aiefficiencyflops

48.5C

benchmarkcodingsoftware-engineering

SWE-bench

by Princeton NLP

SWE-bench is a benchmark for evaluating AI systems' ability to resolve real GitHub issues from popular Python repositories. Each instance requires understanding a codebase, identifying the bug, and producing a correct patch. SWE-bench Verified is the curated subset accepted as the standard for coding agent evaluation by the AI industry.

benchmarkembeddingsretrieval

MTEB

by Hugging Face / MTEB Team

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating text embedding models across 8 task types (retrieval, clustering, classification, etc.) and 112 datasets. The MTEB leaderboard on Hugging Face is the primary reference for selecting embedding models and is updated continuously as new models are released.

benchmarkknowledgemultitask

MMLU

by UC Berkeley

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark covering 57 academic subjects from elementary to professional level, including STEM, law, medicine, and social sciences. It became the standard for measuring general knowledge breadth in LLMs and is included in virtually every model evaluation suite.

benchmarkcontamination-resistantlive

LiveBench

by LiveBench OSS

LiveBench is a contamination-resistant benchmark that continuously updates with new questions sourced from recent math competitions, research papers, and news. By using only data post-dating model training cutoffs, LiveBench mitigates benchmark saturation and provides more reliable capability assessments of frontier models.

HumanEval

by OpenAI

HumanEval is OpenAI's code generation benchmark consisting of 164 hand-written Python programming problems with unit tests. It measures a model's ability to generate syntactically correct and functionally complete code from docstring descriptions. HumanEval is the foundational coding benchmark that all subsequent code benchmarks build upon.

benchmarkcodingpython

benchmarkholisticfairness

HELM

by Stanford CRFM

HELM (Holistic Evaluation of Language Models) from Stanford CRFM provides a multi-dimensional evaluation framework that measures LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It evaluates models on 42 scenarios and 59 metrics, providing the most comprehensive public assessment of LLM capabilities and risks.

benchmarksciencereasoning

GPQA Diamond

by NYU / Cohere

GPQA Diamond (Graduate-Level Google-Proof Q&A) is a challenging multiple-choice benchmark requiring expert-level knowledge in biology, chemistry, and physics. Questions are designed to be answerable by domain PhD students but not by web search. GPQA Diamond is the standard for measuring frontier scientific reasoning capability.

benchmarkhuman-evaluationelo

Chatbot Arena

by LMSys

Chatbot Arena is a crowdsourced human evaluation platform from LMSys where users anonymously compare responses from two random LLMs and vote for the better one. The resulting Elo-based leaderboard (LMSYS Leaderboard) is widely regarded as the most reliable measure of real-world LLM preference across diverse user tasks.

ARC-AGI-2

by ARC Prize Foundation

ARC-AGI-2 is the second iteration of François Chollet's Abstraction and Reasoning Corpus benchmark, designed to measure fluid intelligence and generalization in AI systems. Tasks require identifying abstract visual patterns that cannot be solved by memorization, targeting a capability gap that separates current LLMs from human-level reasoning.

benchmarkagiabstraction

AIME 2025

by MAA / Community Eval

AIME (American Invitational Mathematics Examination) 2025 is used as a frontier math reasoning benchmark for LLMs. The competition-level math problems require multi-step reasoning without lookup, making AIME scores a direct indicator of a model's mathematical problem-solving depth. Frontier models are evaluated on the 2025 problem set to avoid training data contamination.

benchmarkmathreasoning