Skip to main content
BenchmarkLLMsv1.0

MMLU

by UC Berkeley / CRFM · open-source · Last verified 2026-03-01

Massive Multitask Language Understanding benchmark covering 57 academic subjects from STEM to humanities. Measures broad knowledge and reasoning ability through multiple-choice questions at varying difficulty levels from elementary to professional.

https://github.com/hendrycks/test
A
AGreat
Adoption: A+Quality: AFreshness: B+Citations: A+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, knowledge-testing, multi-domain-assessment, reasoning-evaluation
Integrations
lm-eval-harness, helm
Use Cases
model-comparison, knowledge-assessment, training-evaluation, research
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
accuracy, 5-shot-accuracy, per-subject-accuracy
Methodology
Multiple-choice questions across 57 subjects evaluated with 0-shot and 5-shot prompting. Models select from four answer options per question.
Last Run
2026-02-15
Tags
benchmark, evaluation, knowledge, reasoning, multitask
Added
2026-03-17
Completeness
100%

Index Score

80.5
Adoption
96
Quality
88
Freshness
74
Citations
98
Engagement
0

Explore the full AI ecosystem on Agents as a Service