BigCodeBench
by Zhuo et al. / BigCode / Hugging Face · open-source · Last verified 2026-03-17
BigCodeBench evaluates code generation models on 1,140 practical programming tasks requiring use of real Python libraries such as NumPy, Pandas, Scikit-learn, and Matplotlib. Unlike HumanEval, tasks require multi-library integration and complex function-level reasoning, better reflecting real-world software engineering.
https://bigcode-bench.github.io ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- evaluation, code-generation, library-use
- Integrations
- huggingface
- Use Cases
- model-evaluation, code-ai, software-engineering
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, deepseek-coder-v2, qwen2-5-coder-32b
- Metrics
- pass-at-1, pass-at-5
- Methodology
- 1,140 function-level tasks with 5.6 test cases each. Models generate a single Python function; execution-based pass@1 and pass@5 computed over 10 generations per task. Instruction-following variant (BigCodeBench-Instruct) uses natural language docstrings only.
- Last Run
- 2026-02-28
- Tags
- coding, function-level, libraries, pragmatic, python
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
66.3Adoption
74
Quality
91
Freshness
85
Citations
74
Engagement
0