Chatbot Arena Hard
by LMSYS · free · Last verified 2026-03-01
Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.
https://github.com/lm-sys/arena-hard-auto ↗B
B—Above Average
Adoption: B+Quality: AFreshness: A+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- free
- Capabilities
- comparative-model-analysis, leaderboard-ranking, reproducible-evaluation, automated-judging-with-gpt4, hard-prompt-testing, instruction-following-assessment, coding-and-reasoning-evaluation, writing-style-assessment
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- win-rate, elo-rating
- Methodology
- 500 challenging prompts selected from Arena for maximum discriminative power. Automated GPT-4-based judging compares model outputs pairwise against baseline.
- Last Run
- 2026-03-10
- Tags
- benchmark, evaluation, chat, hard-prompts, human-preference, llm-evaluation, model-comparison, automated-judging, leaderboard, lmsys, reproducible-research
- Added
- 2026-03-17
- Completeness
- 0.85%
Index Score
63.9Adoption
72
Quality
88
Freshness
90
Citations
70
Engagement
0