Chatbot Arena Hard
by LMSYS · open-source · Last verified 2026-03-01
Curated subset of 500 challenging user prompts from Chatbot Arena that most effectively discriminate between model capabilities. Provides a static, reproducible version of the Arena methodology with automated judging for quick model evaluation on hard tasks.
https://github.com/lm-sys/arena-hard-auto ↗B
B—Above Average
Adoption: B+Quality: AFreshness: A+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- model-evaluation, hard-prompt-testing, automated-judging
- Integrations
- arena-hard-auto
- Use Cases
- quick-model-comparison, hard-prompt-evaluation, reproducible-benchmarking
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- win-rate, elo-rating
- Methodology
- 500 challenging prompts selected from Arena for maximum discriminative power. Automated GPT-4-based judging compares model outputs pairwise against baseline.
- Last Run
- 2026-03-10
- Tags
- benchmark, evaluation, chat, hard-prompts, human-preference
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
63.9Adoption
72
Quality
88
Freshness
90
Citations
70
Engagement
0