Model Evaluation Harness
by AaaS · open-source · Last verified 2026-03-01
Comprehensive model evaluation script that runs models against standard benchmarks including MMLU, HumanEval, GSM8K, and custom evaluation sets. Produces detailed reports with per-category breakdowns, confidence intervals, and comparison charts.
https://aaas.blog/script/model-evaluation-harness ↗C+
C+—Average
Adoption: BQuality: AFreshness: ACitations: C+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- multi-benchmark-evaluation, report-generation, comparison-charts, confidence-intervals, custom-eval-support
- Integrations
- lm-eval, openai, anthropic, datasets, pandas
- Use Cases
- model-selection, fine-tuning-evaluation, regression-testing, capability-assessment
- API Available
- No
- Language
- python
- Dependencies
- lm-eval, openai, anthropic, datasets, pandas, matplotlib
- Environment
- Python 3.11+ with CUDA 12 for local models
- Est. Runtime
- 30-120 minutes depending on benchmark count and model size
- Tags
- script, automation, evaluation, benchmarking, testing
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
53.9Adoption
60
Quality
82
Freshness
80
Citations
54
Engagement
0