HumanEval+

Evaluate your code generation models using HumanEval+, an extended version of OpenAI's HumanEval benchmark. This Action Pack guides you through setting up the benchmark, generating code solutions, and running the enhanced evaluation with additional test cases.

codeevaluationtestingcode-generationllm-evaluationbenchmarkpython

5 Steps

1
Clone the HumanEval+ Repository: Obtain the HumanEval+ benchmark by cloning its official GitHub repository to your local machine.
2
Set Up Your Environment: Navigate into the cloned directory and install the required Python dependencies to prepare your evaluation environment.
3
Generate Code Completions: Integrate your code generation LLM to produce solutions for the problems defined in HumanEval+. Save these completions in the expected format (e.g., JSONL) for evaluation.
4
Run the Evaluation Script: Execute the HumanEval+ evaluation script against your generated code completions. This script will run the original and extended test cases.
5
Analyze Evaluation Results: Review the output from the evaluation script, focusing on pass@k metrics and detailed results for both original and additional test cases to understand your model's performance.

Ready to run this action pack?

Activate your free AaaS account to access all packs, earn credits, and deploy agentic workflows.

Get Started Free →

← Back to Academy