AlpacaEval
by Stanford · open-source · Last verified 2026-03-01
Automated evaluation framework comparing model outputs against a reference model on 805 instructions. Uses LLM judges to determine win rates, with length-controlled metrics to avoid rewarding verbosity over quality.
https://github.com/tatsu-lab/alpaca_eval ↗B
B—Above Average
Adoption: AQuality: AFreshness: ACitations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- model-evaluation, automated-comparison, instruction-following-assessment
- Integrations
- alpaca-eval
- Use Cases
- model-comparison, instruction-following-evaluation, chat-model-ranking
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- win-rate, lc-win-rate, avg-length
- Methodology
- 805 instructions from diverse categories. Model outputs compared against GPT-4-Turbo baseline by automated judges. Length-controlled win rate adjusts for verbosity bias.
- Last Run
- 2026-02-20
- Tags
- benchmark, evaluation, instruction-following, automated, comparison
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
67.9Adoption
80
Quality
82
Freshness
80
Citations
78
Engagement
0