BenchmarkLLMsv1.0

BigCodeBench

by Zhuo et al. / BigCode / Hugging Face · free · Last verified 2026-03-17

BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.

https://bigcode-bench.github.io ↗

B—Above Average

Adoption: B+Quality: A+Freshness: ACitations: B+Engagement: F

Specifications

License: Apache-2.0
Pricing: free
Capabilities: Evaluating LLM code generation proficiency, Assessing multi-library code integration, Testing complex function-level reasoning, Benchmarking performance on data science tasks, Measuring practical Python programming skills, Validating model usage of NumPy, Pandas, and Scikit-learn, Providing a standardized testbed for code models
Integrations
Use Cases: [object Object], [object Object], [object Object], [object Object]
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, deepseek-coder-v2, qwen2-5-coder-32b
Metrics: pass-at-1, pass-at-5
Methodology: 1,140 function-level tasks with 5.6 test cases each. Models generate a single Python function; execution-based pass@1 and pass@5 computed over 10 generations per task. Instruction-following variant (BigCodeBench-Instruct) uses natural language docstrings only.
Last Run: 2026-02-28
Tags: benchmark, code-generation, llm-evaluation, python, data-science, function-level, library-usage, numpy, pandas, scikit-learn, pragmatic-coding
Added: 2026-03-17
Completeness: 1%

Index Score

66.3

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service