BenchmarkAI Agentsv1.0

AgentBoard

by Ma et al. / Shanghai AI Lab · free · Last verified 2026-03-17

AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.

https://hkust-nlp.github.io/agentboard/ ↗

B—Above Average

Adoption: BQuality: AFreshness: ACitations: B+Engagement: F

Specifications

License: MIT
Pricing: free
Capabilities: multi-task-agent-evaluation, sub-goal-progress-tracking, embodied-ai-benchmarking, web-browsing-agent-testing, tool-use-capability-assessment, database-operation-evaluation, os-interaction-simulation, code-execution-verification, comparative-agent-analysis
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, gemini-2-5-pro, llama-3-70b
Metrics: success-rate, progress-rate
Methodology: Nine task environments; each task has multiple sub-goals. Success rate = fraction of complete task resolutions. Progress rate = average fraction of sub-goals completed, enabling partial-credit evaluation that discriminates between agent capability levels.
Last Run: 2026-02-22
Tags: agent-evaluation, llm-benchmark, multi-task-evaluation, embodied-ai, web-browsing, tool-use, gaming-ai, database-ops, os-interaction, code-execution, puzzle-solving
Added: 2026-03-17
Completeness: 0.8%

Index Score

61.1

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service