Breadcrumb
The Multi Model Mirage
Starting Where Everyone Starts
My first instinct was to wire together existing models. It is the obvious move when you have a problem that spans multiple domains. Vision, reasoning, execution. Gemini Flash handles the visual perception, you send it a screenshot and it gives you bounding boxes. Cerebras handles the text reasoning since it is fast and can decide what to do given some structured screen state. GPT-4.1 sits as an alternative multimodal reasoner in case you want to send the raw screenshot plus a task and get back an action directly.
I think most people start here because it feels like the right abstraction. Each model does what it is best at. You just need good glue in between.
What I learned is that the glue is the problem.
Latency Adds Up in Ways You Don’t Expect
The fastest configuration I could build was a three tier sense plan act loop. SSIM based screen change detection at 3ms to know when the page has settled. Local OCR at 150ms to pull text and bounding boxes from pixels without hitting an API. Cerebras batch planning at 200ms to decide on 1 to 5 actions. Gemini as a fallback only when OCR comes back empty. Around 550ms per action total.
That sounds fast until you do the math. 300 actions at 550ms each is 165 seconds. The benchmark gives you 5 minutes. That is barely enough with zero margin for error, and there will always be errors.
But the deeper lesson was about failure modes at the seams. The agent spun 200 iterations when a page closed because there was no exit condition for TargetClosedError. I had a coordinate truthiness bug where not (x and y) evaluates True when x is 0, silently rejecting valid clicks at the left edge of the screen. The reasoning model’s JSON got truncated because the output schema put action fields last and the token limit cut them off. Gemini consistently estimated y coordinates 50 to 150px too low which meant every click drifted down the page.
Each of these bugs lives at the boundary between two systems. The perception model does not know that the reasoning model’s output format will get truncated. The reasoning model does not know the execution layer rejects zero coordinates. Every junction is a place where assumptions from one system do not hold in the next.
The Real Hard Problem
The popups were annoying but manageable. The real challenge that exposed the architecture’s limits was something deeper. The benchmark requires you to reason and act at the same time. You need to look at a scrollable modal full of radio buttons, figure out which one is correct, scroll to find it, click it, and submit. You need to solve visual puzzles to reveal a code and then type that code into a form. These are not just perception or just reasoning tasks, they require both simultaneously in a tight loop.
Current CUAs do not operate this way. They either reason well but act slowly, or they act fast but cannot reason about what they are seeing. My pipeline had the same problem. The perception model could find elements on screen but had no understanding of task context. The reasoning model could plan a strategy but had no way to verify whether its plan was actually working visually. Trying to bolt reasoning onto a small action model through orchestration was fighting a losing battle because the two capabilities need to be unified, not stitched together through JSON.
That realization is what killed the pipeline. I was spending more engineering effort on the orchestration between models than on making any single model better. The complexity was in the plumbing, not the intelligence.
What The Architecture Taught Me
The fundamental issue is not any single model being bad. It is the assumption that perception, reasoning, and execution should be separate steps. Each step adds latency and each boundary between steps is a failure point.
A human does not perceive, then reason, then execute as distinct phases. You see a popup and your hand is already moving toward the close button. The perception and the action are fluid. Splitting them apart and serializing them through JSON over API calls is fighting against how intelligent behavior actually works.
The multi model pipeline looked like intelligence from the outside. Three different AI models collaborating on a task. Inside it was three systems that could not share context fast enough to be useful. The architecture had to go. What I needed was a single model going directly from pixels to actions with no intermediary.
“The impediment to action advances action. What stands in the way becomes the way.” —Marcus Aurelius