HumanEval+
by BigCode · open-source · Last verified 2026-03-01
Augmented version of HumanEval with 80x more test cases per problem to catch subtle bugs that pass the original limited tests. Reveals that models scoring high on HumanEval often fail on edge cases, boundary conditions, and uncommon inputs.
https://github.com/evalplus/evalplus ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: ACitations: BEngagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- model-evaluation, rigorous-code-testing, edge-case-assessment
- Integrations
- evalplus
- Use Cases
- code-model-evaluation, robustness-testing, edge-case-analysis
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- pass@1, pass@1-base-tests
- Methodology
- Same 164 HumanEval problems with 80x more tests including edge cases. Compares pass rates on original tests vs augmented tests to measure robustness.
- Last Run
- 2026-02-20
- Tags
- benchmark, evaluation, coding, rigorous-testing, edge-cases
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
63.8Adoption
72
Quality
90
Freshness
84
Citations
68
Engagement
0