Skip to main content
BenchmarkLLMsv1.0

BigCodeBench

by Zhuo et al. / BigCode / Hugging Face · open-source · Last verified 2026-03-17

BigCodeBench evaluates code generation models on 1,140 practical programming tasks requiring use of real Python libraries such as NumPy, Pandas, Scikit-learn, and Matplotlib. Unlike HumanEval, tasks require multi-library integration and complex function-level reasoning, better reflecting real-world software engineering.

https://bigcode-bench.github.io
B
BAbove Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
evaluation, code-generation, library-use
Integrations
huggingface
Use Cases
model-evaluation, code-ai, software-engineering
API Available
No
Evaluated Models
gpt-4o, claude-opus-4, deepseek-coder-v2, qwen2-5-coder-32b
Metrics
pass-at-1, pass-at-5
Methodology
1,140 function-level tasks with 5.6 test cases each. Models generate a single Python function; execution-based pass@1 and pass@5 computed over 10 generations per task. Instruction-following variant (BigCodeBench-Instruct) uses natural language docstrings only.
Last Run
2026-02-28
Tags
coding, function-level, libraries, pragmatic, python
Added
2026-03-17
Completeness
100%

Index Score

66.3
Adoption
74
Quality
91
Freshness
85
Citations
74
Engagement
0

Explore the full AI ecosystem on Agents as a Service