BenchmarkLLMsv1.0

Minerva Math

by Google Research · free · Last verified 2026-03-01

Minerva Math is a quantitative reasoning benchmark designed to evaluate large language models on complex STEM problems. Sourced from web pages with LaTeX and arXiv preprints, it covers subjects like math, physics, and chemistry, requiring multi-step computation, symbolic manipulation, and deep scientific understanding to solve.

https://github.com/google-research/minerva ↗

C+

C+—Average

Adoption: BQuality: AFreshness: B+Citations: BEngagement: F

Specifications

License: Apache-2.0
Pricing: free
Capabilities: large-language-model-evaluation, quantitative-reasoning-assessment, stem-problem-solving-benchmarking, mathematical-computation-testing, symbolic-reasoning-evaluation, scientific-knowledge-application, multi-step-reasoning-analysis
Integrations: [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics: accuracy, stem-accuracy
Methodology: STEM problems requiring mathematical computation and scientific reasoning. Models generate step-by-step solutions with final answers checked for correctness.
Last Run: 2026-01-25
Tags: benchmark, evaluation, mathematics, stem, quantitative-reasoning, llm-evaluation, dataset, scientific-reasoning, natural-language-processing, ai-capability-testing
Added: 2026-03-17
Completeness: 0.95%

Index Score

58.9

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service