BigCodeBench
by Zhuo et al. / BigCode / Hugging Face · free · Last verified 2026-03-17
BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.
https://bigcode-bench.github.io ↗B
B—Above Average
Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- free
- Capabilities
- Evaluating LLM code generation proficiency, Assessing multi-library code integration, Testing complex function-level reasoning, Benchmarking performance on data science tasks, Measuring practical Python programming skills, Validating model usage of NumPy, Pandas, and Scikit-learn, Providing a standardized testbed for code models
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Evaluated Models
- gpt-4o, claude-opus-4, deepseek-coder-v2, qwen2-5-coder-32b
- Metrics
- pass-at-1, pass-at-5
- Methodology
- 1,140 function-level tasks with 5.6 test cases each. Models generate a single Python function; execution-based pass@1 and pass@5 computed over 10 generations per task. Instruction-following variant (BigCodeBench-Instruct) uses natural language docstrings only.
- Last Run
- 2026-02-28
- Tags
- benchmark, code-generation, llm-evaluation, python, data-science, function-level, library-usage, numpy, pandas, scikit-learn, pragmatic-coding
- Added
- 2026-03-17
- Completeness
- 1%
Index Score
66.3Adoption
74
Quality
91
Freshness
85
Citations
74
Engagement
0