System guide · Architecture

How agent execution systems actually work

The architecture engineers need to understand before betting a deployment on an agent system. Planning loops, tool dispatch, sandbox isolation, MCP integration, memory integration — explained at protocol depth, with the architectural choices OpenHands / OpenClaw / Goose / Aider / Cline / Continue make differently.

By Fredoline Eruo·Last reviewed 2026-05-06·~21 min read

What an agent execution system actually is

An agent execution system is a runtime that takes a task description and produces real actions in the world — files edited, tests run, commits made, queries executed — by orchestrating an LLM's output into a loop of plan-act-observe steps. The defining property is autonomy across multiple tool calls: a chat interface with a tool-calling LLM is not an agent execution system; an agent execution system explicitly drives a multi-step task to completion (or to a checkpoint where the user is asked to confirm).

The category emerged through 2023-2024 and matured through 2025-2026. The 2026 leaders — OpenHands, OpenClaw, Goose, Aider, Cline, Continue — make different architectural choices across four axes: planning loop shape, tool dispatch model, sandbox isolation, and memory + MCP integration. This page explains all four axes and shows where each tool sits.

What it isn't

  • A chat interface — even one with tool calling. Open WebUI, Claude Desktop chat, ChatGPT — these are conversational UIs that may invoke tools, but the user drives each turn. An agent execution system drives the turns itself.
  • An autocomplete in your IDE — Continue's autocomplete feature, GitHub Copilot inline suggestions. Agent execution is multi-step task completion; autocomplete is single-position prediction.
  • A long-running script that calls an LLM — even one with retries and tool calls. Without a planning primitive (a place where the agent reasons about what to do next, not just what to predict), it's a workflow engine with an LLM step, not an agent.

Architecture: planning loop, tool dispatch, sandbox executor

Every modern agent execution system has three components. They differ in how each is implemented:

The planning loop is where the agent decides what to do next. Two architectural shapes dominate: ReAct (think → act → observe → repeat) and Anthropic-style (thinking → planning → execution decomposition with explicit reasoning blocks). ReAct is older, lighter, faster per turn; Anthropic-style is heavier but produces higher-quality plans on complex tasks. OpenHands and Goose use ReAct-style; OpenClaw uses Anthropic- style; Aider uses a custom git-aware variant.

The tool dispatcher is what the agent calls to take action. Two patterns: monolithic built-in tools (shell, file edit, browser exposed as agent-native functions) and MCP-first dispatch (every tool is an MCP server, including built-ins, dispatched through the same protocol). Goose is the most aggressive MCP-first design; OpenHands and OpenClaw are MCP-friendly but mix in built-ins; Aider is monolithic.

The sandbox executor determines what isolation boundary the agent operates within. Modes: Docker container (default for OpenHands, OpenClaw — isolation per agent run), chroot (lighter weight), native fork (no isolation; trust the user has configured permissions correctly), and none (the agent runs in the same process as the IDE — Cline, Continue, Aider). Pick the strongest sandbox you can tolerate; pick less when the agent needs filesystem access to your live working tree (which is most of the time for coding agents).

ReAct vs Anthropic-style reasoning loops

The most operationally significant architectural choice. ReAct loops are simpler, faster per turn, and produce shorter outputs. Each turn is a decision: think (one to three sentences of reasoning), act (one tool call), observe (the tool result). Repeat until done.

Anthropic-style loops emit explicit <think> blocks before any action — often 200-2000 tokens of intermediate reasoning the user never sees. Inside the thinking block the agent decomposes the task into subgoals, considers alternatives, picks an approach, then emits the actual action. The cost: 10-20% more tokens per task. The benefit: planning quality on complex tasks (refactoring, multi-file changes, ambiguous requirements) is meaningfully higher.

The honest decision rule: ReAct for surgical tasks, Anthropic-style for autonomous multi-step work. Don't pay the reasoning-block tax on simple edits; don't skip it on tasks where the agent will go off the rails without explicit planning.

Workflow: how a task flows through the agent

Concrete example — a task arrives at OpenHands with vLLM as the runtime + Mem0 as the memory layer + filesystem/git MCP servers wired in:

  1. Task ingest. User submits “fix the failing auth-token validation tests.” OpenHands attaches the system prompt + tool schemas + (if Mem0 is enabled) retrieved memory chunks from prior sessions.
  2. Planning step. The model emits a plan: read related files, identify the failure cause, propose the fix, run tests. With Planning Mode enabled, this plan is shown to the user for approval before execution.
  3. Tool dispatch. The agent calls filesystem.read_file("auth/validation.ts") via MCP. The tool dispatcher routes this to the running filesystem MCP server (stdio transport), gets the response, returns it to the agent.
  4. Iteration. The agent reads more files, calls git.diff(), identifies the issue, calls filesystem.write_file() to apply the fix, then shell.run("npm test") to verify.
  5. Memory write-back. At session close, Mem0 consolidates this session's episode into the memory store. The next session can retrieve “we fixed auth-token expiry handling on May 6” if the user asks.
  6. Sandbox isolation. All file reads and shell commands run inside the configured Docker container; if the agent goes off the rails (deletes everything in/tmp for instance), the blast radius is the container.

Tool dispatch — built-ins vs MCP

The split between “the agent has built-in tools that ship with the runtime” and “every tool is an external MCP server” matters for two reasons:

Extensibility. MCP-first agents (Goose, OpenClaw) can add new capabilities by installing MCP servers; monolithic agents (Aider) can't. If your team needs to add “query our internal Postgres” or “read our Notion workspace,” MCP-first agents handle this with a config change. Monolithic agents need a fork.

Failure surface. A bad MCP server can hang the agent loop. Monolithic agents are immune to this class of failure but pay for it with a smaller capability surface. MCP-first agents need a watchdog pattern — either at the stdio-process level (kill MCP servers that don't respond within a timeout) or at the dispatcher level (skip tools whose servers have failed health checks).

The trend: MCP-first is winning. The 2026 leaders all support MCP; the question is whether MCP is the primary dispatch mechanism (Goose) or one of several (OpenHands, OpenClaw). For teams building seriously, default to the MCP-first architecture.

Sandbox isolation

Autonomous agents do unexpected things. Filesystem deletes, unexpected shell commands, accidentally pushing to main — the failure stories are real and frequent. The sandbox boundary determines blast radius:

  • Docker container per session (OpenHands default). Blast radius = the container. Filesystem access is to a mounted working directory; shell commands run inside the container. If the agent goes off the rails, you can throw the container away. The right default for production deployments.
  • chroot. Lighter than Docker; harder to escape than “native”; doesn't isolate network. Used by some research-grade agent runtimes. Rarely seen in production tools today.
  • Native fork. The agent runs as a user process on the host; relies on filesystem-level permissions for isolation. OpenHands supports this for performance-sensitive workflows. Acceptable for trusted tasks; risky otherwise.
  • None. The agent operates in the same process as the IDE — Cline, Continue, Aider. The agent can do anything the user can do. Acceptable when the user is supervising in real-time; risky for autonomous overnight work.

The honest rule: match sandbox strength to supervision. Autonomous agents running while you sleep deserve Docker; agents you're actively watching can use lighter sandboxes.

MCP integration patterns

The three MCP integration patterns we see across the 2026 leaders:

MCP-first (Goose). Every tool is an MCP server. Built-ins (shell, file edit) are MCP servers shipped with the agent. Adding a new tool means installing an MCP server. Strongest extensibility; biggest failure surface (bad MCP server can hang everything).

MCP-friendly (OpenHands, OpenClaw). Built-in tools coexist with MCP servers. The agent dispatches both through the same interface but the implementations are different. Best of both worlds: built-ins for stable, MCP for extensibility. Most production deployments end here.

MCP-as-extension (Cline, Continue). MCP is one of several plugin mechanisms; not the canonical way to add capabilities. Lighter integration; less of an extension ecosystem.

See /systems/mcp for protocol-engineering depth on what MCP actually is and how the three patterns interact with it.

Memory integration patterns

Cross-session memory is what separates “agent that forgets at every restart” from “agent that remembers what we tried last week.” The integration patterns:

Provider abstraction (OpenHands). Configure a memory provider in config; the agent queries memory at planning step and writes summaries at session end. Mem0, Letta, custom providers all plug in via the same interface. The cleanest pattern.

MCP-as-memory (Goose, parts of OpenClaw). Memory is just another MCP server. The agent calls memory.search() like any other tool. Less integrated but more uniform; lower complexity but loses some of the cross-session-summary patterns memory frameworks provide.

No memory (Aider, Continue, Cline). The agent runs single-session-only. Each invocation starts fresh. For surgical-edit tools this is correct — you don't want yesterday's plan lingering. For autonomous multi-step tools it's a real limitation.

See /systems/agent-memory for the protocol-engineering depth on memory itself — vector vs graph vs OS-style; episodic vs semantic vs structured retrieval; consolidation failure modes.

Failure modes specific to autonomous execution

  1. Plan-revision loops. The agent keeps re-planning instead of executing. Symptom: 10 minutes of back-and-forth before a single tool call. Cause: usually Planning Mode + headless mismatch, or an underspecified task. Fix: explicit plan approval (UI) or plan_first = false (headless).
  2. Memory drift between sessions. Episodic memory says one thing; the actual repo / database state says another. The agent confidently reasons against stale knowledge. Mitigation: query MCP-git or MCP-postgres for ground truth before destructive actions.
  3. MCP server hang. A malformed MCP server stalls the agent loop indefinitely. The MCP protocol doesn't enforce hard timeouts at the dispatcher level in most implementations. Wrap MCP server processes in a watchdog.
  4. Sandbox container resource exhaustion. Long agent loops consume disk + memory inside the container. Set Docker resource limits; otherwise the container OOMs and the agent loop dies mid-task.
  5. Tool-call format mismatch. Some local runtimes (older llama.cpp, some Ollama versions) emit tool calls in slightly non-standard JSON. The agent's parser is forgiving but breaks occasionally. Pin known- working runtime versions.
  6. Token-cost runaway. An agent in a re-planning loop on a cloud API can burn $10-50 of tokens in 30 minutes. Set per-task budgets; alert on overruns.
  7. Concurrent-agent crosstalk. Two agents sharing the same memory store or MCP server hit each other. Per-agent isolation is non-optional in multi-agent deployments.
  8. Silent test-suite hang. The agent runs a test suite that takes 10 minutes; the tool-call timeout kills it at 30 seconds; the agent reports “tests passed” based on the truncated output. Set per-tool timeouts that match real test runtime.

OpenHands / OpenClaw / Goose / Aider / Cline / Continue compared

The six 2026 leaders, with the architectural choices each one makes. See the dedicated /maps/coding-agents-2026 for the structured-landscape view; this section is the architecture-axis comparison.

OpenHands. ReAct planning loop with Planning Mode (v1.6+); MCP-friendly tool dispatcher; Docker / chroot / native sandbox modes; provider-abstraction memory (Mem0 / Letta first-class). The longest-track-record open-source agent; production default for self-hosted. See the OpenHands operational review.

OpenClaw. Anthropic-style reasoning loop; MCP-first dispatcher with built-ins as MCP servers; Docker / WSL / native sandbox; foundation governance post- April 2026. The velocity leader (350k+ stars). See the OpenClaw operational review.

Goose. ReAct loops; MCP-first as the primary architectural choice; lightweight extension platform. Block's product. The right pick when MCP- heaviness is core to the workflow.

Aider. Custom git-aware loop; monolithic tool dispatcher; native (no real sandbox); no cross-session memory. Surgical-edit-and-commit paradigm. Different category from the autonomous-agent leaders.

Cline. ReAct loops; MCP-as-extension; runs inside VS Code (no separate sandbox); session-only. Picked up Roo Code's users after that project shut down in May. The current frontrunner in IDE-resident autonomous- task agents.

Continue. Hybrid: chat + autocomplete + commands. MCP-as-extension; runs inside VS Code. Broader surface area than Cline but less aggressive on autonomous- task quality. Pick Continue for everyday IDE help; Cline for autonomous work.

Local vs hosted implications

Agent execution systems are runtime-agnostic — they speak OpenAI-compatible API, work against any provider. The local-vs-hosted choice is about the model, not the agent. But it interacts with agent architecture in three ways:

Token cost on cloud APIs. Anthropic-style reasoning loops (OpenClaw) cost 10-20% more tokens per task. On a $0.50-task workload, the difference is negligible; on a $20-task workload, it's real money. Self-hosted = effectively zero per-token margin once hardware is amortized.

Latency budget. Local models on consumer hardware are 30-100% slower than cloud APIs. For autonomous agents this matters less than for chat (the user isn't waiting for each token); but a 3-minute local task can become a 6-minute task. Plan accordingly.

Privacy + data residency. Cloud APIs see every prompt + every tool call + every file content. For coding agents on private codebases, this is the trump card for self-hosted. Pair with /stacks/local-coding-agent for the canonical recipe.

Reference stacks

The four canonical agent-execution deployments we recommend in May 2026:

Single-developer autonomous coding. /stacks/local-coding-agent — OpenHands + Qwen 2.5 Coder 32B + vLLM + Mem0 + MCP fs/git + RTX 4090. The canonical local-coding-agent recipe.

Memory-enabled multi-session agent. /stacks/memory-enabled-agent — adds Mem0 + MCP-postgres for episodic/semantic/structured memory across sessions.

Mac-native agent. /stacks/apple-silicon-ai — MLX-LM backend; same agent (OpenHands or OpenClaw) on top.

Cloud-Claude hybrid. OpenClaw locally as the agent harness; Claude Sonnet via Anthropic API as the model. Highest capability ceiling, local agent state, per-token rather than per-seat economics.

When agents earn their keep

  • Multi-step tasks with 5-30 tool calls (refactoring, bug triage + fix + verify, test suite stabilization). Agents save real wall-clock time here.
  • Stable codebases where the agent can ground in actual code structure. Agents on greenfield projects with unclear architecture struggle.
  • Tasks the developer would otherwise context- switch on — the value isn't the agent's speed, it's your freed attention.
  • Repetitive grunt work with clear success criteria. “Add tests for these 50 functions” is an agent task; “design the API” is not.

When agents make things worse

  • Tasks under 5 minutes of work. The planning + dispatch overhead exceeds the work itself. Just do it manually.
  • Ambiguous requirements. Agents will confidently build the wrong thing if the spec is unclear. Clarify before delegating.
  • High-stakes irreversible actions. Database migrations, production deploys, anything where the recovery cost of a mistake is material. Use the agent to draft; review and execute manually.
  • Workflows where you're going to have to review every line anyway. The agent hasn't saved time — it's shifted attention from writing to reviewing. Sometimes worth it; often not.

Companion reading: coding-agents ecosystem map for the structured landscape; /systems/mcp and /systems/agent-memory for the protocol layers most agents wire in; /stacks/local-coding-agent for the canonical recipe.