Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkAI Agentsv1.0

AgentBoard

by Ma et al. / Shanghai AI Lab · free · Last verified 2026-03-17

AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.

https://hkust-nlp.github.io/agentboard/
B
BAbove Average
Adoption: BQuality: AFreshness: ACitations: B+Engagement: F

Specifications

License
MIT
Pricing
free
Capabilities
multi-task-agent-evaluation, sub-goal-progress-tracking, embodied-ai-benchmarking, web-browsing-agent-testing, tool-use-capability-assessment, database-operation-evaluation, os-interaction-simulation, code-execution-verification, comparative-agent-analysis
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, gemini-2-5-pro, llama-3-70b
Metrics
success-rate, progress-rate
Methodology
Nine task environments; each task has multiple sub-goals. Success rate = fraction of complete task resolutions. Progress rate = average fraction of sub-goals completed, enabling partial-credit evaluation that discriminates between agent capability levels.
Last Run
2026-02-22
Tags
agent-evaluation, llm-benchmark, multi-task-evaluation, embodied-ai, web-browsing, tool-use, gaming-ai, database-ops, os-interaction, code-execution, puzzle-solving
Added
2026-03-17
Completeness
0.8%

Index Score

61.1
Adoption
65
Quality
88
Freshness
82
Citations
70
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service