BenchmarkLLMsv1.1

WinoGrande

by Allen AI · open-source · Last verified 2026-03-01

Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.

https://winogrande.allenai.org ↗

B—Above Average

Adoption: AQuality: B+Freshness: BCitations: AEngagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: model-evaluation, coreference-testing, commonsense-assessment
Integrations: lm-eval-harness, helm
Use Cases: model-comparison, commonsense-evaluation, language-understanding
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: accuracy, 5-shot-accuracy
Methodology: Binary-choice coreference resolution tasks. Models select which of two entities a pronoun refers to based on contextual and commonsense cues.
Last Run: 2026-01-15
Tags: benchmark, evaluation, commonsense, coreference, reasoning
Added: 2026-03-17
Completeness: 100%

Index Score

69.7

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service