BenchmarkAI Agentsv1.0

MLAgentBench

by Huang et al. / Stanford · open-source · Last verified 2026-03-17

MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.

https://github.com/snap-stanford/MLAgentBench ↗

C+

C+—Average

Adoption: BQuality: AFreshness: B+Citations: BEngagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: evaluation, ml-research-agent, autonomous-experimentation
Integrations
Use Cases: model-evaluation, ai-agents, research-automation
API Available: No
Evaluated Models: gpt-4o, claude-opus-4, llama-3-70b
Metrics: success-rate, final-performance-gain
Methodology: 13 ML tasks from Kaggle competitions and research benchmarks. Agents operate with a 24-hour wall clock budget per task, iterating on code, training, and evaluation. Success = achieving a predefined performance threshold above baseline.
Last Run: 2026-01-30
Tags: agents, ml-research, coding, experimentation, autonomous
Added: 2026-03-17
Completeness: 100%

Index Score

57.9

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service