Skip to main content
brand
context
industry
strategy
AaaS
BenchmarkAI for Codev0.2

HumanEval+

by BigCode · free · Last verified 2026-03-01

HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.

https://github.com/evalplus/evalplus
B
BAbove Average
Adoption: B+Quality: A+Freshness: ACitations: BEngagement: F

Specifications

License
MIT
Pricing
free
Capabilities
functional correctness verification, robustness analysis of code generation models, edge case and boundary condition testing, bug detection in LLM-generated code, comparative model benchmarking, identifying false positives from standard evaluations, regression testing for code model updates
Integrations
Use Cases
[object Object], [object Object], [object Object], [object Object]
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
pass@1, pass@1-base-tests
Methodology
Same 164 HumanEval problems with 80x more tests including edge cases. Compares pass rates on original tests vs augmented tests to measure robustness.
Last Run
2026-02-20
Tags
benchmark, evaluation, coding, rigorous-testing, edge-cases, code-generation, python, dataset, llm-evaluation, functional-correctness, robustness-testing
Added
2026-03-17
Completeness
0.9%

Index Score

63.8
Adoption
72
Quality
90
Freshness
84
Citations
68
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service