AgentBench
by Tsinghua University · open-source · Last verified 2026-03-01
Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.
https://github.com/THUDM/AgentBench ↗C+
C+—Average
Adoption: BQuality: AFreshness: ACitations: BEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- agent-evaluation, multi-environment-testing, generalization-assessment
- Integrations
- docker
- Use Cases
- agent-comparison, generalization-evaluation, interactive-agent-testing
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- overall-score, os-score, db-score, web-score
- Methodology
- 8 interactive environments testing different agent capabilities. Overall score computed as weighted average across environments based on task completion and action accuracy.
- Last Run
- 2026-02-20
- Tags
- benchmark, evaluation, agents, multi-environment, interactive
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
59.3Adoption
64
Quality
86
Freshness
82
Citations
66
Engagement
0