TAU-bench
by Sierra AI · open-source · Last verified 2026-03-01
Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.
https://github.com/sierra-research/tau-bench ↗C+
C+—Average
Adoption: C+Quality: AFreshness: A+Citations: C+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- agent-evaluation, tool-use-testing, workflow-assessment
- Integrations
- docker
- Use Cases
- agent-benchmarking, customer-service-evaluation, tool-use-assessment
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
- Metrics
- success-rate, tool-accuracy, policy-compliance
- Methodology
- Realistic customer service scenarios requiring multi-step tool interactions. Agents interact with simulated users and tool APIs, evaluated on task completion and policy adherence.
- Last Run
- 2026-03-10
- Tags
- benchmark, evaluation, agents, tool-use, real-world
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
54.8Adoption
58
Quality
88
Freshness
92
Citations
56
Engagement
0