HumanEval+
by BigCode · free · Last verified 2026-03-01
HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.
https://github.com/evalplus/evalplus ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: ACitations: BEngagement: F
Specifications
- License
- MIT
- Pricing
- free
- Capabilities
- functional correctness verification, robustness analysis of code generation models, edge case and boundary condition testing, bug detection in LLM-generated code, comparative model benchmarking, identifying false positives from standard evaluations, regression testing for code model updates
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- pass@1, pass@1-base-tests
- Methodology
- Same 164 HumanEval problems with 80x more tests including edge cases. Compares pass rates on original tests vs augmented tests to measure robustness.
- Last Run
- 2026-02-20
- Tags
- benchmark, evaluation, coding, rigorous-testing, edge-cases, code-generation, python, dataset, llm-evaluation, functional-correctness, robustness-testing
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
63.8Adoption
72
Quality
90
Freshness
84
Citations
68
Engagement
0