Skip to main content
BenchmarkLLMsv1.0

HellaSwag

by Allen AI · open-source · Last verified 2026-03-01

Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.

https://rowanzellers.com/hellaswag/
B+
B+Good
Adoption: A+Quality: AFreshness: BCitations: AEngagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, commonsense-testing, completion-assessment
Integrations
lm-eval-harness, helm
Use Cases
model-comparison, commonsense-evaluation, pre-training-assessment
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
accuracy, 10-shot-accuracy
Methodology
Scenario completion task with adversarially generated wrong endings. Models select from four possible continuations with 10-shot evaluation.
Last Run
2026-01-15
Tags
benchmark, evaluation, commonsense, completion, reasoning
Added
2026-03-17
Completeness
100%

Index Score

74
Adoption
90
Quality
80
Freshness
68
Citations
88
Engagement
0

Explore the full AI ecosystem on Agents as a Service