Benchmarkbenchmarks-evaluationv1.0

HELM

by Stanford CRFM · free · Last verified 2026-04-24

HELM (Holistic Evaluation of Language Models) from Stanford CRFM provides a multi-dimensional evaluation framework that measures LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It evaluates models on 42 scenarios and 59 metrics, providing the most comprehensive public assessment of LLM capabilities and risks.

https://crfm.stanford.edu/helm/ ↗

C—Below Average

Adoption: C+Quality: B+Freshness: ACitations: CEngagement: F

Specifications

License: Proprietary
Pricing: free
Capabilities
Integrations
Use Cases
API Available: No
Tags: benchmark, holistic, fairness, robustness, calibration, stanford, comprehensive
Added: 2026-04-24
Completeness: 60%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service