Skip to main content
BenchmarkAI Agentsv1.0

TAU-bench

by Sierra AI · open-source · Last verified 2026-03-01

Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.

https://github.com/sierra-research/tau-bench
C+
C+Average
Adoption: C+Quality: AFreshness: A+Citations: C+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
agent-evaluation, tool-use-testing, workflow-assessment
Integrations
docker
Use Cases
agent-benchmarking, customer-service-evaluation, tool-use-assessment
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics
success-rate, tool-accuracy, policy-compliance
Methodology
Realistic customer service scenarios requiring multi-step tool interactions. Agents interact with simulated users and tool APIs, evaluated on task completion and policy adherence.
Last Run
2026-03-10
Tags
benchmark, evaluation, agents, tool-use, real-world
Added
2026-03-17
Completeness
100%

Index Score

54.8
Adoption
58
Quality
88
Freshness
92
Citations
56
Engagement
0

Explore the full AI ecosystem on Agents as a Service