Benchmarkbenchmarks-evaluationv1.0

MMLU

by UC Berkeley · free · Last verified 2026-04-24

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark covering 57 academic subjects from elementary to professional level, including STEM, law, medicine, and social sciences. It became the standard for measuring general knowledge breadth in LLMs and is included in virtually every model evaluation suite.

https://github.com/hendrycks/test ↗

C—Below Average

Adoption: C+Quality: B+Freshness: ACitations: CEngagement: F

Specifications

License: Proprietary
Pricing: free
Capabilities
Integrations
Use Cases
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: accuracy, 5-shot-accuracy, per-subject-accuracy
Methodology: Multiple-choice questions across 57 subjects evaluated with 0-shot and 5-shot prompting. Models select from four answer options per question.
Last Run: 2026-02-15
Tags: benchmark, knowledge, multitask, academic, comprehensive, standard
Added: 2026-04-24
Completeness: 60%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service