How to build a research agent that browses the web
Web search API, browser automation tool, LLM
What this does
A research agent automates the task of gathering information from the web, extracting key details from multiple sources, and generating a structured report with citations. It searches for relevant pages, scrapes their content, synthesizes findings using an LLM, and outputs a formatted Markdown report.
Steps
Create a search.py module that wraps a web search API. Implement a function search(query, num_results=10) that returns a list of dictionaries with keys title, url, snippet, and published_date. Store the API key in the SEARCH_API_KEY environment variable and add rate-limiting with a token bucket algorithm limiting to 10 requests per minute.
Create a scraper.py module using Playwright to load each returned URL and extract visible text content. For each URL, launch a headless Chromium instance, navigate to the page, wait for network idle, and retrieve the full text with page.inner_text("body"). Strip boilerplate such as navigation, footer, and ads by removing HTML elements matching a blocklist. Chunk the resulting text into segments of up to 2000 tokens and store them alongside the source URL.
Create a synthesizer.py module that receives a research query and the list of scraped chunks. Build a prompt that includes the query, the chunk list formatted with source URLs as headings, and an instruction to produce a structured report with sections for Executive Summary, Key Findings, and References. The prompt includes an instruction to cite sources inline using bracket notation. Send the prompt to the LLM and stream the response.
Format the LLM response into a Markdown document by appending a References section listing all cited URLs in order of appearance. Include metadata at the top of the file: research query, date, and number of sources. Save the report as research_report_{timestamp}.md in the reports/ directory. Implement a post-processing step that verifies every citation in the body appears in the References section, logging a warning if orphan citations are found.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
Run the research agent with a test query: python -m research_agent --query "impact of transformer architecture on NLP benchmarks 2023". Expected output: a report file generated in the reports/ directory with a filename matching research_report_*.md; the report contains Executive Summary, Key Findings, and References sections; each Key Finding contains at least one inline citation; the References section lists all cited URLs with valid HTTP links; post-processing log confirms all citations resolved; the headless browser launches without console errors.
Common failures
- Search API returning empty results: The query is overly specific or the API key has no remaining quota. Validate the API key on startup and add a fallback that rewrites ambiguous queries with broader terms.
- Page content blocked by JavaScript rendering: Static HTTP requests to single-page applications return empty bodies. Enforce the use of Playwright for all scrapes and increase the network idle timeout to 5 seconds.
- Rate limiting on target sites: Scraping the same domain repeatedly triggers HTTP
403 Forbiddenor CAPTCHA challenges. Add a domain-level cooldown so no more than 1 request per domain is made within 60 seconds. - LLM omitting required citations: The model produces a report without citing specific sources. Include a zero-shot example in the synthesis prompt demonstrating the exact citation format.
- Report filename collision: Multiple runs within the same second overwrite the previous report. Use a Unix timestamp suffix including milliseconds to guarantee unique filenames.
- Boilerplate stripping too aggressively: Legitimate content inside article or section tags is removed. Validate the minimum chunk length after stripping and fall back to extracting raw HTML if it drops below 200 characters.
Related guides
- Implement agent planning and task decomposition — search, scrape, and synthesis sub-tasks align with a decomposition graph.
- Set up agent-human collaboration workflows — the final synthesized report can be routed through human review before being shared externally.