BenchmarkLLMsv1.0

GSM8K

by OpenAI · open-source · Last verified 2026-03-01

Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.

https://github.com/openai/grade-school-math ↗

C+

C+—Average

Adoption: A+Quality: AFreshness: B+Citations: FEngagement: F

Specifications

License: MIT
Pricing: open-source
Capabilities: model-evaluation, math-reasoning-testing, step-by-step-evaluation
Integrations: lm-eval-harness
Use Cases: math-ability-testing, reasoning-evaluation, model-comparison
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: accuracy, 8-shot-accuracy
Methodology: Grade school math word problems requiring 2-8 step solutions. Models show work and provide final numerical answer evaluated for exact match.
Last Run: 2026-01-15
Tags: benchmark, evaluation, math, grade-school, reasoning
Added: 2026-03-17
Completeness: 80%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service