Skip to main content
BenchmarkLLMsvDiamond

GPQA

by NYU · open-source · Last verified 2026-03-01

Graduate-level Google-Proof Question Answering benchmark featuring questions written by domain experts in physics, chemistry, and biology. Questions are designed to be unsearchable, requiring genuine reasoning rather than memorization.

https://github.com/idavidrein/gpqa
B+
B+Good
Adoption: AQuality: A+Freshness: ACitations: AEngagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, expert-knowledge-testing, reasoning-assessment
Integrations
lm-eval-harness
Use Cases
frontier-model-evaluation, reasoning-benchmarking, expert-level-assessment
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics
accuracy, diamond-accuracy
Methodology
Expert-written multiple-choice questions in STEM fields. Diamond subset contains the hardest questions validated by multiple domain experts.
Last Run
2026-02-25
Tags
benchmark, evaluation, graduate-level, reasoning, expert
Added
2026-03-17
Completeness
100%

Index Score

71.6
Adoption
82
Quality
94
Freshness
86
Citations
80
Engagement
0

Explore the full AI ecosystem on Agents as a Service