Skip to main content
BenchmarkAI Agentsv1.0

AgentBench

by Tsinghua University · open-source · Last verified 2026-03-01

Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.

https://github.com/THUDM/AgentBench
C+
C+Average
Adoption: BQuality: AFreshness: ACitations: BEngagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
agent-evaluation, multi-environment-testing, generalization-assessment
Integrations
docker
Use Cases
agent-comparison, generalization-evaluation, interactive-agent-testing
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
overall-score, os-score, db-score, web-score
Methodology
8 interactive environments testing different agent capabilities. Overall score computed as weighted average across environments based on task completion and action accuracy.
Last Run
2026-02-20
Tags
benchmark, evaluation, agents, multi-environment, interactive
Added
2026-03-17
Completeness
100%

Index Score

59.3
Adoption
64
Quality
86
Freshness
82
Citations
66
Engagement
0

Explore the full AI ecosystem on Agents as a Service