Skip to main content
BenchmarkAI for Codev0.2

HumanEval+

by BigCode · open-source · Last verified 2026-03-01

Augmented version of HumanEval with 80x more test cases per problem to catch subtle bugs that pass the original limited tests. Reveals that models scoring high on HumanEval often fail on edge cases, boundary conditions, and uncommon inputs.

https://github.com/evalplus/evalplus
B
BAbove Average
Adoption: B+Quality: A+Freshness: ACitations: BEngagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, rigorous-code-testing, edge-case-assessment
Integrations
evalplus
Use Cases
code-model-evaluation, robustness-testing, edge-case-analysis
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
pass@1, pass@1-base-tests
Methodology
Same 164 HumanEval problems with 80x more tests including edge cases. Compares pass rates on original tests vs augmented tests to measure robustness.
Last Run
2026-02-20
Tags
benchmark, evaluation, coding, rigorous-testing, edge-cases
Added
2026-03-17
Completeness
100%

Index Score

63.8
Adoption
72
Quality
90
Freshness
84
Citations
68
Engagement
0

Explore the full AI ecosystem on Agents as a Service