ToolBench
by Qin et al. / Tsinghua University · open-source · Last verified 2026-03-17
ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.
https://github.com/OpenBMB/ToolBench ↗B
B—Above Average
Adoption: B+Quality: AFreshness: B+Citations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- evaluation, tool-use, api-integration, agent-planning
- Integrations
- rapidapi
- Use Cases
- model-evaluation, ai-agents, tool-augmented-llm
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, toolllama, llama-3-70b
- Metrics
- pass-rate, win-rate, solvable-pass-rate
- Methodology
- Instructions require single-tool or multi-tool API call sequences. Models interact with live or cached APIs; solutions are evaluated by ChatGPT preference scoring (win rate) and functional correctness (pass rate). Solvable-pass-rate filters to instructions with valid API solutions.
- Last Run
- 2026-02-14
- Tags
- tool-use, api, agents, rest, planning
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
67Adoption
74
Quality
88
Freshness
79
Citations
79
Engagement
0