MT-Bench
by LMSYS · open-source · Last verified 2026-03-01
Multi-turn conversation benchmark with 80 high-quality questions across 8 categories including writing, reasoning, math, coding, and extraction. Uses GPT-4 as an automated judge to evaluate response quality on a 1-10 scale across two conversation turns.
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge ↗B+
B+—Good
Adoption: AQuality: AFreshness: B+Citations: AEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- model-evaluation, multi-turn-testing, automated-judging
- Integrations
- fastchat
- Use Cases
- chat-model-comparison, instruction-following-evaluation, multi-turn-assessment
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- average-score, turn-1-score, turn-2-score
- Methodology
- 80 multi-turn questions across 8 categories. Responses judged by GPT-4 on a 1-10 scale for each turn. Average score computed across all categories.
- Last Run
- 2026-02-10
- Tags
- benchmark, evaluation, multi-turn, chat, instruction-following
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
72.2Adoption
86
Quality
84
Freshness
78
Citations
84
Engagement
0