Training VLM for CUA — Tzafon

11 min read Original article ↗

Contributors: Nikita Khomich*, Leopold Pluto Hermansson*, David Dinucu Jianu, Ido Hakimi, Yerniyaz Nurgabylov, Noga Bregman, Simon Koser, Noah Löfquist, Mark Rogers

*Core contributors

Many have tried to solve the task of getting an LLM to use a computer. Until recently, these have all mostly relied on SFT. The reason for this is that it’s easy to define, e.g. just figure out the task you want to solve and collect data, and you can get fast gains on targeted benchmarks. However, what you quickly realize is that without a lot of regularization this saturates after 100 to 1000 examples on a task and beyond this we start to get degradation in other abilities, making it a whack-a-mole game.

Even more problematic, the improvements do not generalize. Without generalization we lose out on the most important ability of a LLM – being able to work across many environments – which is highly expected in real computer use applications.

The reasons for the lack of generalization from SFT are several, not exclusively listed:

  1. The first order learning mechanism is to memorize, so the model doesn’t learn why an action should be done but rather learns that it should be done when it observes a state
  2. The penalty of incorrect actions are often not compatible with how language is modeled. Take the simplest example of trying to click a button: we might say that the ground truth action is to click on position x and y, what we would expect is that the model would in some sense learn it should click the button that is under x and y, but what actually happens is that all the coordinates except the exact coordinate that was given in the example will be negatively penalized uniformly, so clicking 1 pixel away is the same as clicking on the other side of the screen.
  3. The above problems exist in all ML tasks, but what makes it very difficult in VisionLM, is that we have extremely large amounts of input data compared to the prediction we want to make – think of it as the number calculation/decisions made by the neural network to answer the question where to click and then compare this to language to language models that have the same amount of input and output data.


See loss from training Qwen3-VL on SFT UI data

The first order issue we address is the model’s ability to perceive, how we can improve the vision encoders and output a strong signal that improves the model on downstream capabilities. Second up is generalization instead of memorization, which comes down to the model learning to robustly navigate states that were not encountered in the training data, especially states that stem from environmental noise or changes.

Side-Quest: Internal Representation
The perception problem is a deep rabbit hole, because of all the layers of complexity and ambiguity. In general we don’t want to think of the image encoder as a separate component, as we can’t really say what we want it to do. We can reason about the fact that we expect that the embedding/outputs should contain some semantic and positional information that can be used later by the decoder, but to what extent the image encoder should transform the image or retain a lossless representation is unclear.

What we can carefully do instead is make experiments and observations that give us insight into what the model might or might not be doing.

Image encoders, particularly models like Qwen3-VL, function by transforming an input image into a set of patches. These patches are then processed through an image transformer, where the information in each patch is treated as a token. This process involves multiple layers where each token (representing a patch) attends to all other tokens in a non-causal 2D (or 3D for video) manner.
Qwen3-VL implementation for reference.

By observing the intermediate and final image patch embedding we can understand how the information is distributed and evolves. This can be done by comparing the cosine similarity of pairs of embedding, which tells us how the information changes through depth. What we see directly is that even though the model has ~30 layers, already after layer 2, the data is very similar with the last layer, hinting at the fact that a strong representation of the content of the image has already been formed.

Next, we can look at how it changes across the image dimension. To do this, we will first pick a steering vector that describes the semantic meaning of an object. What we can do next is compare this vector with the image patches and how much they contain apples. Below are examples of patch representations, looking at how the information about the apple is represented across different images.

What these experiments indicate is that from all image patches in an image, we can probe what object is present, and the position – so each patch knows global important information. What this also indicates is that the image tokens stream is closer to a look-up table, in contrast to the initial patch-embedding, that only has very local information, the final representation has compressed data of the entire image.

Side-Quest Position Decay
Positional decay in the image encoder. To explain this we need to understand that the model's ability to understand where in the image things are comes from two positional mechanisms. Starting with the 2D-RoPE, which is applied at every attention layer by rotating the query and key vectors based on the relative offset between patches. This helps the model understand spatial relationships between patches, but it only encodes relative position — it can't tell the model where a patch sits in absolute image coordinates. To fix this we train an additive patch embedding: we take the information in a patch and before it passes through the attention we add some extra values unique to each location. Which means that each token knows where it is and can query and infer information about specific locations in the image.

This matters because CUA tasks require outputting absolute coordinates, and for that the only signal is the additive patch embedding. The problem is that this embedding is added once at the input, and then at each layer (there are 20-30 in Qwen3-VL depending on size) we get new information from the other tokens added back. The reason the model can't really learn to overcome this is numerical stability — to keep training stable we have to normalize the information vector at each layer, which means the original positional data is reduced with exponential decay. Since 2D-RoPE doesn't carry absolute position, it can't compensate for this loss.

This is something that Qwen3 tries to mitigate with something called deepstack, which takes data from earlier layers of the image encoder and adds them to the later stages. In the original paper they do show some improvements, but what we saw was that completely removing them without any retraining has minor impact, which likely means that this isn't enough to solve the core issue.

Another experiment we ran to convince ourselves that this is actually a problem was to just scale the positional embedding without any other changes. What we see is that in images with very few objects accuracy improves dramatically (going from 40% to 80% click accuracy on a benchmark that consists in clicking a red ball). (To reproduce the experiment we found that scaling the positional embedding by 3 on Qwen/Qwen3-4B-VL-Instruct). With this evidence, we do believe that VL models would gain from having a stronger absolute positional embedding.

Reinforcement Learning
Given the issues we faced with memorization and training on discrete coordinate tokens, we are pushed in the direction of RL. The RL setup offers many benefits to what we have looked at before, the most obvious being how the loss is applied. By using a GRPO loss we can give a reward depending on each model actually achieving the task and we can also add secondary rewards to enforce more robust behavior like clicking near the center of buttons.

By adapting prime-rl (https://github.com/PrimeIntellect-ai/prime-rl) to support multi-modal (which is supported as of Feb 2026 out-of-the-box) we can convert either synthetic or annotated datasets with bounding boxes, into rl environments, where the reward is statically calculated based on the click coordinates and the ground truth bounding box. What is even more surprising, is that we do not really need a realistic environment, by only training on limited and fabricated test environments. The model nicely learns, but more importantly, it generalizes across other benchmarks. On an aggregated ui benchmark, we see 0.39 -> 0.53 improvement. Even though we only train on generated simplified environments. What is more impressive, is that this is better than the performance we see SFT training on UI datasets.

Multi-turn
Given that it seems like we can robustly teach the model on a single turn, we now move to multi-turn environments. This means that the model now interacts with an application in a loop and then the action is executed and a new state is generated, until it runs out of steps or completes the task. Though, we are happy to see the reward go up, what we rather care about is why/how it goes up.

For context our training is on 100 environments that require 3-15 click interactions to succeed. The environment mainly tests abstract capabilities instead of replicating real apps.

Other than improving on the training environment, we see an improvement in accuracy and ability to execute long horizon tasks in benchmarks like OS-World, where we get a 20% absolute improvement on the Chrome category — even though our training environment has no resemblance to OS-World's environment or tasks.
Some key observations:

  • Reward goes up 🚀
  • The model learns to solve new tasks, which is a good sign that the range of difficulty is good, and that there is some amount of generalisation in what it learns.
  • From the entropy and the actual rollouts, we can see some interesting behaviors. The entropy increases quite steadily, from what we can attribute this to the model exploring more in the text space before generating an action, as in most RL we let the model think for a limited amount of tokens before the answer. Specifically we see that the traces here become more varied and also more informative. For example, the model realizes errors or unintended outcomes based on what it planned to do and what actually happened. One key emerging change is that the model becomes less likely to repeat itself. This stems from the fact the models have a strong bias to repeat what is already in the context, which in action is problematic. What we expect from a robust system, and what we see after the multi-turn training is that the model realizes an interaction failed based on the history, and decides to either try something entirely different, or at least adjust the action slightly and try again. These are behaviors that cannot be derived systematically from SFT because they require incorporating the behavior the model has. For example, the probability of trying again or trying something else, depends on the model's ability, so simply mimicking a human does not train this.

These observations tie in to the final fact that robustness is key for solving long horizon tasks. Specifically the success of an agentic model is only partially related to accuracy/performance and more about robustness. To show this we create a table of how accuracy has to scale to achieve a success rate on a certain horizon, if the success of each step is independent and every step has to succeed.

Recovery from failures > Click accuracy

Trajectory Length 50% success rate 80% success rate 95% success rate
1 0.5000 0.8000 0.9500
2 0.7071 0.8944 0.9747
4 0.8409 0.9457 0.9873
8 0.9170 0.9725 0.9936
16 0.9576 0.9862 0.9968
32 0.9786 0.9931 0.9984

Even if you have a high failure tolerance, e.g we can retry the workflow, the accuracy required when the number of steps increases becomes impossible to achieve. Instead, what we want to do is improve the model's ability, to recover from failures and adapt to slight out-of-distribution variation when the model is deployed compared to in training.

OS-World – 50 steps comparative results Using Evocua agent code, and compared to current open source sota in similar size. (EVOCUA-8B avg 32.5% vs Northstar CUA Fast (RL) 37.0%)