SWE-bench Verified
by Princeton NLP · open-source · Last verified 2026-03-01
Human-validated subset of SWE-bench containing 500 problems verified by software engineers for correctness, clarity, and solvability. Provides a more reliable signal than the full SWE-bench by filtering out ambiguous or under-specified issues.
https://www.swebench.com ↗B+
B+—Good
Adoption: AQuality: A+Freshness: A+Citations: AEngagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- model-evaluation, agent-evaluation, software-engineering-assessment
- Integrations
- docker, github
- Use Cases
- agent-benchmarking, coding-evaluation, software-engineering-assessment
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
- Metrics
- resolve-rate, pass@1
- Methodology
- 500 human-verified GitHub issues from real open-source projects. Models must produce patches that pass all repository tests in isolated Docker environments.
- Last Run
- 2026-03-01
- Tags
- benchmark, evaluation, software-engineering, agents, verified
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
74.4Adoption
84
Quality
94
Freshness
90
Citations
88
Engagement
0