BenchmarkAI Agentsv1.0

OSWorld

by University of Hong Kong · open-source · Last verified 2026-03-01

Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.

https://os-world.github.io ↗

D—Poor

Adoption: C+Quality: AFreshness: A+Citations: FEngagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: agent-evaluation, os-interaction-testing, gui-automation-assessment
Integrations: docker, vnc
Use Cases: desktop-agent-benchmarking, os-automation-evaluation, gui-agent-testing
API Available: No
Evaluated Models: claude-4, gpt-5, gemini-2.5-pro
Metrics: success-rate, screenshot-accuracy
Methodology: 369 real computer tasks across Ubuntu, Windows, and macOS. Agents interact via screenshots and keyboard/mouse actions, evaluated by checking OS state post-execution.
Last Run: 2026-03-05
Tags: benchmark, evaluation, agents, os, desktop-automation
Added: 2026-03-17
Completeness: 80%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service