Agent Evaluation
by AaaS · open-source · Last verified 2026-03-17
Provides a comprehensive framework for evaluating agentic systems across task completion rate, trajectory efficiency, tool use correctness, and safety violations. Covers trajectory-based evaluation with LLM judges, automated regression test harnesses, and leaderboard design for comparing agent versions.
https://aaas.blog/skill/agent-evaluation ↗C+
C+—Average
Adoption: BQuality: AFreshness: A+Citations: C+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- trajectory-evaluation, llm-as-judge, task-completion-measurement, safety-violation-detection, regression-testing
- Integrations
- langsmith, braintrust, agentops, inspect-ai
- Use Cases
- agent-benchmarking, ci-cd-for-agents, safety-testing, capability-regression-tracking
- API Available
- No
- Difficulty
- advanced
- Prerequisites
- reflection, planning
- Supported Agents
- claude-code
- Tags
- evaluation, benchmarking, agent-testing, trajectory-eval, evals
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
56.9Adoption
62
Quality
88
Freshness
90
Citations
58
Engagement
0