Skip to main content
BenchmarkLLMsv1.0

MMLU-Pro

by TIGER-Lab · open-source · Last verified 2026-03-01

Enhanced version of MMLU with harder questions, 10 answer choices instead of 4, and reduced sensitivity to prompt formatting. Provides more discriminative evaluation of frontier model capabilities with a focus on reasoning-intensive questions.

https://github.com/TIGER-AI-Lab/MMLU-Pro
B
BAbove Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, advanced-knowledge-testing, reasoning-assessment
Integrations
lm-eval-harness
Use Cases
frontier-model-comparison, reasoning-evaluation, research
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
accuracy, 5-shot-accuracy
Methodology
Multiple-choice questions with 10 options across augmented MMLU subjects. Evaluates with chain-of-thought prompting to test reasoning depth.
Last Run
2026-02-20
Tags
benchmark, evaluation, knowledge, reasoning, advanced
Added
2026-03-17
Completeness
100%

Index Score

67.2
Adoption
78
Quality
90
Freshness
88
Citations
72
Engagement
0

Explore the full AI ecosystem on Agents as a Service