OSWorld
by University of Hong Kong · open-source · Last verified 2026-03-01
Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.
https://os-world.github.io ↗C+
C+—Average
Adoption: C+Quality: AFreshness: A+Citations: C+Engagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- agent-evaluation, os-interaction-testing, gui-automation-assessment
- Integrations
- docker, vnc
- Use Cases
- desktop-agent-benchmarking, os-automation-evaluation, gui-agent-testing
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro
- Metrics
- success-rate, screenshot-accuracy
- Methodology
- 369 real computer tasks across Ubuntu, Windows, and macOS. Agents interact via screenshots and keyboard/mouse actions, evaluated by checking OS state post-execution.
- Last Run
- 2026-03-05
- Tags
- benchmark, evaluation, agents, os, desktop-automation
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
53.7Adoption
54
Quality
88
Freshness
90
Citations
58
Engagement
0