Skip to main content
BenchmarkAI Agentsv1.0

OSWorld

by University of Hong Kong · open-source · Last verified 2026-03-01

Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.

https://os-world.github.io
C+
C+Average
Adoption: C+Quality: AFreshness: A+Citations: C+Engagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
agent-evaluation, os-interaction-testing, gui-automation-assessment
Integrations
docker, vnc
Use Cases
desktop-agent-benchmarking, os-automation-evaluation, gui-agent-testing
API Available
No
Evaluated Models
claude-4, gpt-5, gemini-2.5-pro
Metrics
success-rate, screenshot-accuracy
Methodology
369 real computer tasks across Ubuntu, Windows, and macOS. Agents interact via screenshots and keyboard/mouse actions, evaluated by checking OS state post-execution.
Last Run
2026-03-05
Tags
benchmark, evaluation, agents, os, desktop-automation
Added
2026-03-17
Completeness
100%

Index Score

53.7
Adoption
54
Quality
88
Freshness
90
Citations
58
Engagement
0

Explore the full AI ecosystem on Agents as a Service