Skip to main content
BenchmarkAI for CodevVerified 1.0

SWE-bench

by Princeton NLP · open-source · Last verified 2026-03-01

Benchmark for evaluating LLMs and AI agents on real-world software engineering tasks drawn from GitHub issues. Tests the ability to understand codebases, diagnose bugs, and produce working patches.

https://www.swebench.com
B+
B+Good
Adoption: AQuality: A+Freshness: A+Citations: A+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, agent-evaluation, code-generation-testing, regression-testing
Integrations
github, docker
Use Cases
model-comparison, agent-benchmarking, coding-ability-assessment, research
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3
Metrics
resolve-rate, pass@1, patch-accuracy
Methodology
Real GitHub issues with validated test patches. Models must produce code patches that pass repository test suites in isolated Docker environments.
Last Run
2026-02-28
Tags
benchmark, coding, software-engineering, evaluation, agents
Added
2026-03-10
Completeness
100%

Index Score

77.4
Adoption
88
Quality
92
Freshness
90
Citations
95
Engagement
0

Explore the full AI ecosystem on Agents as a Service