MMLU-Pro
by TIGER-Lab · free · Last verified 2026-03-01
MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.
https://github.com/TIGER-AI-Lab/MMLU-Pro ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F
Specifications
- License
- MIT
- Pricing
- free
- Capabilities
- Discriminative evaluation of frontier language models, Advanced reasoning and problem-solving assessment, Multi-disciplinary knowledge testing across STEM, humanities, and social sciences, Robustness testing against prompt sensitivity, Chain-of-thought and complex reasoning evaluation, Reduced likelihood of answer leakage and superficial pattern matching, Comparative analysis of state-of-the-art model performance
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- accuracy, 5-shot-accuracy
- Methodology
- Multiple-choice questions with 10 options across augmented MMLU subjects. Evaluates with chain-of-thought prompting to test reasoning depth.
- Last Run
- 2026-02-20
- Tags
- benchmark, model-evaluation, llm-testing, knowledge-assessment, reasoning-benchmark, natural-language-understanding, frontier-models, academic-research, ai-robustness
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
67.2Adoption
78
Quality
90
Freshness
88
Citations
72
Engagement
0