Agents & agentic AI

Browser Agent

A browser agent is an AI-driven program that controls a web browser to automate tasks like form filling, data extraction, or navigation. It uses a local or remote LLM to interpret instructions, generate actions (e.g., click, type, scroll), and process page content. Operators encounter browser agents when running frameworks like Playwright or Puppeteer paired with a local LLM via Ollama or vLLM. The agent typically captures screenshots or DOM snapshots, sends them to the LLM for reasoning, and executes the returned action. Latency depends on model size and hardware: a 7B Q4 model on an RTX 4090 yields ~2-5 seconds per action, while a 70B model may take 10-30 seconds.

Deeper dive

Browser agents extend LLM capabilities to interact with web interfaces. The workflow: 1) The agent loads a target URL, 2) captures the current page state (screenshot or HTML DOM), 3) sends it with a task prompt to the LLM, 4) the LLM outputs a structured action (e.g., click(button#submit)), 5) the agent executes it and repeats. Key challenges: visual grounding (matching LLM output to page elements), context window limits (long pages may exceed 4K-32K tokens), and latency (each action requires a full inference pass). Operators often use smaller quantized models (e.g., Qwen2.5 7B Q4) for speed, or larger models (e.g., Llama 3.1 70B) for complex reasoning. Tools like Browser-Use, Playwright, and LangChain integrate with local LLM backends.

Practical example

An operator runs a browser agent to automate logging into a web app. The agent uses Playwright with an Ollama-served Qwen2.5 7B Q4 model on an RTX 3090. The agent navigates to the login page, captures a screenshot, and the LLM outputs type('#username', 'admin') then type('#password', 'pass123') then click('#login-btn'). Each action takes ~3 seconds. If the page has a CAPTCHA, the agent may fail because the LLM cannot solve it without a vision model.

Workflow example

In a typical setup, an operator installs browser-use and ollama, pulls qwen2.5:7b, and runs python agent.py --task "book a flight on kayak.com". The script launches a Chromium window, iteratively captures screenshots, sends them to Ollama's API, and executes actions. The operator monitors tokens/sec in Ollama logs and adjusts context length (e.g., --num-ctx 8192) if the page is large. If the agent stalls, the operator may switch to a larger model or reduce task complexity.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work