AgentBoard
by Ma et al. / Shanghai AI Lab · free · Last verified 2026-03-17
AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.
https://hkust-nlp.github.io/agentboard/ ↗B
B—Above Average
Adoption: BQuality: AFreshness: ACitations: B+Engagement: F
Specifications
- License
- MIT
- Pricing
- free
- Capabilities
- multi-task-agent-evaluation, sub-goal-progress-tracking, embodied-ai-benchmarking, web-browsing-agent-testing, tool-use-capability-assessment, database-operation-evaluation, os-interaction-simulation, code-execution-verification, comparative-agent-analysis
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, gemini-2-5-pro, llama-3-70b
- Metrics
- success-rate, progress-rate
- Methodology
- Nine task environments; each task has multiple sub-goals. Success rate = fraction of complete task resolutions. Progress rate = average fraction of sub-goals completed, enabling partial-credit evaluation that discriminates between agent capability levels.
- Last Run
- 2026-02-22
- Tags
- agent-evaluation, llm-benchmark, multi-task-evaluation, embodied-ai, web-browsing, tool-use, gaming-ai, database-ops, os-interaction, code-execution, puzzle-solving
- Added
- 2026-03-17
- Completeness
- 0.8%
Index Score
61.1Adoption
65
Quality
88
Freshness
82
Citations
70
Engagement
0