HumanEval
by OpenAI · open-source · Last verified 2026-03-01
Hand-written Python programming problems with function signatures, docstrings, and test cases for evaluating code generation. Each problem requires implementing a function that passes a set of unit tests, measuring functional correctness rather than textual similarity.
https://github.com/openai/human-eval ↗B+
B+—Good
Adoption: A+Quality: AFreshness: B+Citations: A+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- model-evaluation, code-generation-testing, functional-correctness-assessment
- Integrations
- lm-eval-harness
- Use Cases
- code-model-comparison, coding-ability-assessment, research
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro, deepseek-v3, llama-4-405b
- Metrics
- pass@1, pass@10
- Methodology
- 164 hand-written Python programming problems. Models generate function implementations evaluated by executing test cases. Pass@k measures probability of at least one correct solution in k samples.
- Last Run
- 2026-02-15
- Tags
- benchmark, evaluation, coding, python, function-generation
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
78.4Adoption
94
Quality
84
Freshness
72
Citations
96
Engagement
0