Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkLLMsv1.0

MMLU-Pro

by TIGER-Lab · free · Last verified 2026-03-01

MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.

https://github.com/TIGER-AI-Lab/MMLU-Pro
B
BAbove Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F

Specifications

License
MIT
Pricing
free
Capabilities
Discriminative evaluation of frontier language models, Advanced reasoning and problem-solving assessment, Multi-disciplinary knowledge testing across STEM, humanities, and social sciences, Robustness testing against prompt sensitivity, Chain-of-thought and complex reasoning evaluation, Reduced likelihood of answer leakage and superficial pattern matching, Comparative analysis of state-of-the-art model performance
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
accuracy, 5-shot-accuracy
Methodology
Multiple-choice questions with 10 options across augmented MMLU subjects. Evaluates with chain-of-thought prompting to test reasoning depth.
Last Run
2026-02-20
Tags
benchmark, model-evaluation, llm-testing, knowledge-assessment, reasoning-benchmark, natural-language-understanding, frontier-models, academic-research, ai-robustness
Added
2026-03-17
Completeness
0.9%

Index Score

67.2
Adoption
78
Quality
90
Freshness
88
Citations
72
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service