Computer-Use Agents

Agents that operate desktop applications via screenshot + mouse/keyboard. Anthropic Computer Use API, OS-Atlas, ShowUI.

Setup walkthrough

Install Ollama → ollama pull qwen2.5-vl:7b (~5 GB — vision required for seeing desktop screenshots).
pip install pyautogui pillow (screenshot capture + mouse/keyboard control).
Basic computer-use agent loop:

import ollama, pyautogui, time
def computer_use_agent(task):
    for step in range(10):
        screenshot = pyautogui.screenshot()
        ss_bytes = screenshot.tobytes()
        resp = ollama.chat(model="qwen2.5-vl:7b", messages=[{
            "role": "user",
            "content": f"Task: {task}\nDescribe what you see and what action to take next. Format: ACTION: click(x,y) or ACTION: type('text') or ACTION: done",
            "images": [ss_bytes]
        }])
        action = resp["message"]["content"]
        # Parse action and execute via pyautogui
        print(f"Step {step}: {action}")
        if "done" in action: break
        time.sleep(2)

computer_use_agent("Open Notepad, type 'Hello World', save to Desktop as hello.txt")

First agent loop in 30-90 seconds for a 3-5 step task on 12 GB GPU. The VLM analyzes each screenshot, decides the next action.
For production: use OS-Copilot or UFO (Microsoft's Windows agent framework) which add accessibility-tree reading + grounding for higher reliability than screenshot-only.
Reality: computer-use agents are early-2026 research-grade. They succeed on simple tasks ~70% of the time and fail on complex UIs (modal dialogs, drag-and-drop, multi-monitor).

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Qwen2-VL 7B at 5-10 seconds per screenshot analysis — a 5-step task completes in 1-2 minutes. For automating repetitive desktop tasks (file organization, data entry, screenshot annotation): $400 is viable for tasks you'd otherwise spend 30+ minutes on. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$360-405. Computer-use agents at $400 work for simple, well-defined, repeatable tasks. They fail at novel tasks and complex multi-app workflows.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Qwen2-VL 72B at 10-20 seconds per screenshot — the 72B offers dramatically better UI understanding, element grounding, and error recovery. For RPA (robotic process automation) replacement with AI agents: the 72B correctly navigates enterprise apps (SAP, Salesforce, Oracle) that confuse 7B models. Total: ~$1,800-2,200. Computer-use agents are one of the few tasks where the jump from 7B to 72B is transformative — the larger model correctly reads error dialogs, dropdown menus, and nested tabs that the 7B misidentifies.

Common beginner mistake

The mistake: Running a computer-use agent on your main desktop while you're also using it — the agent randomly clicks on your browser, closes your tabs, moves your files. Why it fails: The agent sees screenshots of the entire screen. It doesn't know which windows are "yours" vs. "its workspace." If you open Slack while the agent is running, the agent might click on a message, type garbage, or send a message. The fix: Run the agent in a VM or a dedicated workspace. Windows Sandbox (built into Windows Pro) or a VirtualBox VM provides an isolated desktop. The agent can do whatever it wants in the VM — it can't touch your real files. For tasks that need your real desktop, quit all other apps before running the agent. Or: use a dedicated computer (old laptop) as the agent's workspace. The agent has the impulse control of a toddler — don't give it access to anything you care about.

Recommended setup for computer-use agents

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running computer-use agents locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle computer-use agents before committing money.

Hardware buying guidance for Computer-Use Agents

Agent workflows run multiple tool calls in sequence — sustained tok/s matters more than peak. The guides below frame the buyer decision.

best GPU for AI agents — covers sustained-throughput vs peak, multi-tool-call latency, agent loop economics.
best GPU for Qwen
best GPU for Llama

Related tasks

Browser Agents UI / Screenshot Analysis

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →