Xuetian Chen and Shanghai AI Lab built a reality check for computer-using agents:
Experiments show that even State-of-the-Art agent struggle with higher-level tasks involving perception, reasoning, and coordination—highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment.
Current agent benchmarks are like judging a chef by how fast they can chop onions without ever asking them to cook a meal. Narrow skills in sanitized environments that completely miss the messy reality of how humans use computers.
There's a massive gap between lab performance and real-world utility.
OS-MAP evaluates agents on two dimensions: depth (how autonomous they are) and breadth (how well skills transfer across domains). From simple command execution to proactive assistance, from work to study to entertainment.
Instead of one-dimensional leaderboards, you get a two-dimensional capability map.
The results are sobering: even the most advanced agents only achieve 11.4% success rates. That's nowhere close to human performance. The challenge isn't building more powerful models. It's understanding where agents actually live on this capability map and designing systems that know their own limitations.
How do you design for an agent that's reliable at simple tasks but fails completely at complex orchestration? Maybe the future isn't one agent conquering everything, but systems that understand their boundaries and know when to hand off to humans.