Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkLLMsv1.0

Chatbot Arena Hard

by LMSYS · free · Last verified 2026-03-01

Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.

https://github.com/lm-sys/arena-hard-auto
B
BAbove Average
Adoption: B+Quality: AFreshness: A+Citations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
free
Capabilities
comparative-model-analysis, leaderboard-ranking, reproducible-evaluation, automated-judging-with-gpt4, hard-prompt-testing, instruction-following-assessment, coding-and-reasoning-evaluation, writing-style-assessment
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
win-rate, elo-rating
Methodology
500 challenging prompts selected from Arena for maximum discriminative power. Automated GPT-4-based judging compares model outputs pairwise against baseline.
Last Run
2026-03-10
Tags
benchmark, evaluation, chat, hard-prompts, human-preference, llm-evaluation, model-comparison, automated-judging, leaderboard, lmsys, reproducible-research
Added
2026-03-17
Completeness
0.85%

Index Score

63.9
Adoption
72
Quality
88
Freshness
90
Citations
70
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service