ScriptAI Infrastructurev1.0

Model Evaluation Harness

by AaaS · open-source · Last verified 2026-03-01

Comprehensive model evaluation script that runs models against standard benchmarks including MMLU, HumanEval, GSM8K, and custom evaluation sets. Produces detailed reports with per-category breakdowns, confidence intervals, and comparison charts.

https://aaas.blog/script/model-evaluation-harness ↗

C+

C+—Average

Adoption: BQuality: AFreshness: ACitations: C+Engagement: F

Specifications

License: MIT
Pricing: open-source
Capabilities: multi-benchmark-evaluation, report-generation, comparison-charts, confidence-intervals, custom-eval-support
Integrations: lm-eval, openai, anthropic, datasets, pandas
Use Cases: model-selection, fine-tuning-evaluation, regression-testing, capability-assessment
API Available: No
Language: python
Dependencies: lm-eval, openai, anthropic, datasets, pandas, matplotlib
Environment: Python 3.11+ with CUDA 12 for local models
Est. Runtime: 30-120 minutes depending on benchmark count and model size
Tags: script, automation, evaluation, benchmarking, testing
Added: 2026-03-17
Completeness: 100%

Index Score

53.9

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service