TESS Models

TESS Models are Vision-Language-Action models trained to control computers like humans. The models take screenshots as input and output mouse/keyboard actions.

Current Focus

Training VLA models on computer-use datasets
Building infrastructure for large-scale data collection
Benchmarking against existing computer-use agents

Architecture

TESS uses a vision encoder to process screenshots, a language model for reasoning, and an action head for predicting mouse coordinates and keyboard inputs.

Current Focus

Architecture

Links