Skip to main content
BenchmarkAI for Codev1.0

HumanEval

by OpenAI · open-source · Last verified 2026-03-01

Hand-written Python programming problems with function signatures, docstrings, and test cases for evaluating code generation. Each problem requires implementing a function that passes a set of unit tests, measuring functional correctness rather than textual similarity.

https://github.com/openai/human-eval
B+
B+Good
Adoption: A+Quality: AFreshness: B+Citations: A+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
model-evaluation, code-generation-testing, functional-correctness-assessment
Integrations
lm-eval-harness
Use Cases
code-model-comparison, coding-ability-assessment, research
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
Metrics
pass@1, pass@10
Methodology
164 hand-written Python programming problems. Models generate function implementations evaluated by executing test cases. Pass@k measures probability of at least one correct solution in k samples.
Last Run
2026-02-15
Tags
benchmark, evaluation, coding, python, function-generation
Added
2026-03-17
Completeness
100%

Index Score

78.4
Adoption
94
Quality
84
Freshness
72
Citations
96
Engagement
0

Explore the full AI ecosystem on Agents as a Service