Vision agents vs. structured APIs on the same internal tool task

5 points by FirestarAlpha 7 days ago · 0 comments · 1 min read

Vision agents (browser-use, computer-use) are the default for letting AI agents operate web apps without APIs. Writing an MCP or REST API per app is the alternative, but every app needs its own. Enterprise teams have 20+ internal tools.

We ran the agents on a Reflex port of a react demo (a small business’ admin panel). The task was to find the "Smith" with the most orders, accept their pending reviews, mark their most recent order as delivered.

Results (medians, n=5 API / n=3 vision):

- Vision agent: 47 steps, 495k tokens, ~14 min - API agent: 8 calls, 12k tokens, 19.7s

The vision agent failed on the abstract task and needed a 14-step UI walkthrough before completing it, and even with the walkthrough it made 47 round-trips each carrying a full-page screenshot.

Vision-run variance was wide enough (853-1296s, 407k-751k tokens) that a single run isn't representative, while API runs were tightly clustered. This is the cost of being lazy about making an agent-friendly interface.

The endpoints in Path B were auto-generated by a plugin shipped in Reflex 0.9 this week. You can find full methodology here: https://reflex.dev/blog/vision-agents-vs-api-calls/

No comments yet.

Settings

Vision agents vs. structured APIs on the same internal tool task

Keyboard Shortcuts