BenchmarkAI for Codev1.0

MLE-bench

by OpenAI · open-source · Last verified 2026-03-01

Benchmark evaluating AI agents on real Kaggle machine learning competitions. Tests the full ML engineering pipeline including data exploration, feature engineering, model selection, training, and submission formatting against actual competition leaderboards.

https://github.com/openai/mle-bench ↗

C+

C+—Average

Adoption: C+Quality: AFreshness: A+Citations: C+Engagement: F

Specifications

License: MIT
Pricing: open-source
Capabilities: agent-evaluation, ml-pipeline-testing, competition-benchmarking
Integrations: docker, kaggle
Use Cases: ml-agent-evaluation, data-science-capability-testing, research
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro
Metrics: medal-rate, above-median-rate, competition-score
Methodology: 75 real Kaggle competitions. Agents work in sandboxed environments with dataset access, generating submissions scored against actual competition metrics.
Last Run: 2026-03-05
Tags: benchmark, evaluation, machine-learning, kaggle, data-science
Added: 2026-03-17
Completeness: 100%

Index Score

54.8

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service