BenchmarkLLMsv1.0

HellaSwag

by Allen AI · open-source · Last verified 2026-03-01

Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.

https://rowanzellers.com/hellaswag/ ↗

B+

B+—Good

Adoption: A+Quality: AFreshness: BCitations: AEngagement: F

Specifications

License: MIT
Pricing: open-source
Capabilities: model-evaluation, commonsense-testing, completion-assessment
Integrations: lm-eval-harness, helm
Use Cases: model-comparison, commonsense-evaluation, pre-training-assessment
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: accuracy, 10-shot-accuracy
Methodology: Scenario completion task with adversarially generated wrong endings. Models select from four possible continuations with 10-shot evaluation.
Last Run: 2026-01-15
Tags: benchmark, evaluation, commonsense, completion, reasoning
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service