HumanEval+
Use HumanEval+ to rigorously evaluate AI code generation models. This extended benchmark features significantly more test cases, ensuring robust assessment and preventing overfitting for better model generalization.
4 Steps
- 1
Install EvalPlus: Install the EvalPlus library using pip. This package provides the extended HumanEval benchmark and its evaluation tools.
- 2
Prepare Model Code Samples: Format your LLM's generated code outputs into a JSONL file. Each line should be a JSON object with a 'task_id' (e.g., 'HumanEval/0') and 'completion' (the generated code string).
- 3
Run Evaluation: Execute EvalPlus against your model's prepared code samples. Replace 'your_model_name' with a unique identifier for the model you are evaluating.
- 4
Analyze Results: Review the detailed evaluation report generated by EvalPlus. This report includes pass@k metrics and identifies specific test case failures, providing insights into your model's generalization and robustness.
Ready to run this action pack?
Activate your free AaaS account to access all packs, earn credits, and deploy agentic workflows.
Get Started Free →