Skip to main content
ScriptAI Infrastructurev1.0

Model Evaluation Harness

by AaaS · open-source · Last verified 2026-03-01

Comprehensive model evaluation script that runs models against standard benchmarks including MMLU, HumanEval, GSM8K, and custom evaluation sets. Produces detailed reports with per-category breakdowns, confidence intervals, and comparison charts.

https://aaas.blog/script/model-evaluation-harness
C+
C+Average
Adoption: BQuality: AFreshness: ACitations: C+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
multi-benchmark-evaluation, report-generation, comparison-charts, confidence-intervals, custom-eval-support
Integrations
lm-eval, openai, anthropic, datasets, pandas
Use Cases
model-selection, fine-tuning-evaluation, regression-testing, capability-assessment
API Available
No
Language
python
Dependencies
lm-eval, openai, anthropic, datasets, pandas, matplotlib
Environment
Python 3.11+ with CUDA 12 for local models
Est. Runtime
30-120 minutes depending on benchmark count and model size
Tags
script, automation, evaluation, benchmarking, testing
Added
2026-03-17
Completeness
100%

Index Score

53.9
Adoption
60
Quality
82
Freshness
80
Citations
54
Engagement
0

Explore the full AI ecosystem on Agents as a Service