Skip to main content
BenchmarkLLMsv1.0

Chatbot Arena Hard

by LMSYS · open-source · Last verified 2026-03-01

Curated subset of 500 challenging user prompts from Chatbot Arena that most effectively discriminate between model capabilities. Provides a static, reproducible version of the Arena methodology with automated judging for quick model evaluation on hard tasks.

https://github.com/lm-sys/arena-hard-auto
B
BAbove Average
Adoption: B+Quality: AFreshness: A+Citations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
model-evaluation, hard-prompt-testing, automated-judging
Integrations
arena-hard-auto
Use Cases
quick-model-comparison, hard-prompt-evaluation, reproducible-benchmarking
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
win-rate, elo-rating
Methodology
500 challenging prompts selected from Arena for maximum discriminative power. Automated GPT-4-based judging compares model outputs pairwise against baseline.
Last Run
2026-03-10
Tags
benchmark, evaluation, chat, hard-prompts, human-preference
Added
2026-03-17
Completeness
100%

Index Score

63.9
Adoption
72
Quality
88
Freshness
90
Citations
70
Engagement
0

Explore the full AI ecosystem on Agents as a Service