BenchmarkAI Agentsv1.0

ToolBench

by Qin et al. / Tsinghua University · open-source · Last verified 2026-03-17

ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.

https://github.com/OpenBMB/ToolBench ↗

B—Above Average

Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: evaluation, tool-use, api-integration, agent-planning
Integrations: rapidapi
Use Cases: model-evaluation, ai-agents, tool-augmented-llm
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, toolllama, llama-3-70b
Metrics: pass-rate, win-rate, solvable-pass-rate
Methodology: Instructions require single-tool or multi-tool API call sequences. Models interact with live or cached APIs; solutions are evaluated by ChatGPT preference scoring (win rate) and functional correctness (pass rate). Solvable-pass-rate filters to instructions with valid API solutions.
Last Run: 2026-02-14
Tags: tool-use, api, agents, rest, planning
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service