BenchmarkAI for Codev0.2

HumanEval+

by BigCode · free · Last verified 2026-03-01

HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.

https://github.com/evalplus/evalplus ↗

B—Above Average

Adoption: B+Quality: A+Freshness: ACitations: BEngagement: F

Specifications

License: MIT
Pricing: free
Capabilities: functional correctness verification, robustness analysis of code generation models, edge case and boundary condition testing, bug detection in LLM-generated code, comparative model benchmarking, identifying false positives from standard evaluations, regression testing for code model updates
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics: pass@1, pass@1-base-tests
Methodology: Same 164 HumanEval problems with 80x more tests including edge cases. Compares pass rates on original tests vs augmented tests to measure robustness.
Last Run: 2026-02-20
Tags: benchmark, evaluation, coding, rigorous-testing, edge-cases, code-generation, python, dataset, llm-evaluation, functional-correctness, robustness-testing
Added: 2026-03-17
Completeness: 0.9%

Index Score

63.8

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service