Trying to Solve General Computer Use
My journey into computer use models and what I've learned about training VLAs for digital agents.
When Anthropic first released their Claude Computer Use model, I was taken away by it. However, I would always hit the limitation of the models capability and speed and would lose interest.
My background has been more deep learning for robotics, and after getting to tour the Figure AI factory and watch their Helix VLA perform human level dexterity – it occurred to me, many of Claude’s Computer use limitations can be solved by borrowing ideas from robotic VLAs.
And that quicked off my obsession and journey into computer use models fall of 2025. I read every single paper around computer use starting from Karpathy’s et al. World of Bits paper to the GTA-1 and UI-TARS trying to come up with a new architecture that could work.

I initially tried training this model architecture on AgentNet but the MSE loss was teaching the model to always click towards the center of the screen, and also the text decoder was a big failure.
The low effort approach would be to figure out a way to leverage an existing text decoder for typing for all the existing infra we have, but this immediately fails to drive a simple WASD online car. Nevertheless, many digital software like SolidWorks or graphic design use continuous mouse trajectories – something not a single SoTa CUA model can do.
These are all very easy for kids to learn and do, and there’s something far deeper around embodied AI that I still haven’t figured out. A human can easily adapt to one hand for typing, but our models would easily fail if you “cutoff” some of its output heads in computer use.
I then pivoted into games, replicating the Lumine paper recipe on MineCraft, as well as CSO/Atari using Diamonds world models as an eval for these new models.
I’ve learned a lot about what it takes to train computer use models, with each project teaching me valuable lessons along the way.
I’m excited to explore stronger System I models that can be attached to the great VLM models we have, which if successful can transform the way we use computers forever!