OSWorld
by University of Hong Kong · open-source · Last verified 2026-03-01
Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.
https://os-world.github.io ↗D
D—Poor
Adoption: C+Quality: AFreshness: A+Citations: FEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- agent-evaluation, os-interaction-testing, gui-automation-assessment
- Integrations
- docker, vnc
- Use Cases
- desktop-agent-benchmarking, os-automation-evaluation, gui-agent-testing
- API Available
- No
- Evaluated Models
- claude-4, gpt-5, gemini-2.5-pro
- Metrics
- success-rate, screenshot-accuracy
- Methodology
- 369 real computer tasks across Ubuntu, Windows, and macOS. Agents interact via screenshots and keyboard/mouse actions, evaluated by checking OS state post-execution.
- Last Run
- 2026-03-05
- Tags
- benchmark, evaluation, agents, os, desktop-automation
- Added
- 2026-03-17
- Completeness
- 80%
Index Score
39Adoption
54
Quality
88
Freshness
90
Citations
0
Engagement
0