BenchmarkLLMsv1.0

Chatbot Arena Hard

by LMSYS · free · Last verified 2026-03-01

Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.

https://github.com/lm-sys/arena-hard-auto ↗

B—Above Average

Adoption: B+Quality: AFreshness: A+Citations: B+Engagement: F

Specifications

License: Apache-2.0
Pricing: free
Capabilities: comparative-model-analysis, leaderboard-ranking, reproducible-evaluation, automated-judging-with-gpt4, hard-prompt-testing, instruction-following-assessment, coding-and-reasoning-evaluation, writing-style-assessment
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: win-rate, elo-rating
Methodology: 500 challenging prompts selected from Arena for maximum discriminative power. Automated GPT-4-based judging compares model outputs pairwise against baseline.
Last Run: 2026-03-10
Tags: benchmark, evaluation, chat, hard-prompts, human-preference, llm-evaluation, model-comparison, automated-judging, leaderboard, lmsys, reproducible-research
Added: 2026-03-17
Completeness: 0.85%

Index Score

63.9

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service