Skip to main content
BenchmarkLLMsv1.1

WinoGrande

by Allen AI · open-source · Last verified 2026-03-01

Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.

https://winogrande.allenai.org
B
BAbove Average
Adoption: AQuality: B+Freshness: BCitations: AEngagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
model-evaluation, coreference-testing, commonsense-assessment
Integrations
lm-eval-harness, helm
Use Cases
model-comparison, commonsense-evaluation, language-understanding
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
accuracy, 5-shot-accuracy
Methodology
Binary-choice coreference resolution tasks. Models select which of two entities a pronoun refers to based on contextual and commonsense cues.
Last Run
2026-01-15
Tags
benchmark, evaluation, commonsense, coreference, reasoning
Added
2026-03-17
Completeness
100%

Index Score

69.7
Adoption
84
Quality
78
Freshness
66
Citations
82
Engagement
0

Explore the full AI ecosystem on Agents as a Service