Stack · L3 execution·Workstation tier

Build a local coding-agent stack (May 2026)

A coding agent that drafts diffs, runs tests, and edits files autonomously — entirely on your hardware, with persistent memory of the codebase.

By Fredoline Eruo·Last reviewed 2026-05-06·~12 min read
The stack
  1. 01
    ToolCoding agent (the planning + execution loop)
    openhands

    OpenHands v1.6 ships Planning Mode (drafts a plan before execution) and has the longest production track record in the open-source category. Pick OpenHands over Aider when you want autonomous task execution; pick Aider for surgical git-integrated edits.

  2. 02
    ModelCoding model (the actual brain)
    qwen-2.5-coder-32b-instruct

    Qwen 2.5 Coder 32B Instruct is the strongest open coding model in the 32B class as of May 2026 — beats DeepSeek Coder V2 Lite on HumanEval+ and SWE-Bench Lite at the same VRAM footprint. AWQ-INT4 fits on a 24GB card with headroom for a 32K context window.

  3. 03
    ToolInference engine (production-grade serving)
    vllm

    vLLM over Ollama for this stack: continuous batching means an agent making 5-10 concurrent tool calls per task doesn't queue, prefix caching keeps the system prompt resident across iterations, and the OpenAI-compatible API plugs into OpenHands with zero adapter code. Use Ollama only for single-user laptop chat.

  4. 04
    ToolFile access (the agent's hands on the codebase)
    mcp-server-filesystem

    The Anthropic reference filesystem MCP server with strict directory allowlisting. Required for OpenHands to read and write project files; allowlist limits blast radius when the agent goes off the rails.

  5. 05
    ToolRepository state (status, diff, blame, history)
    mcp-server-git

    Pairs with mcp-server-filesystem to give the agent full repo awareness — read-side operations only by default. Lets OpenHands reason about what changed and why before proposing new edits.

  6. 06
    ToolPersistent memory (codebase context across sessions)
    mem0

    Mem0 over Letta or Zep for this stack: dropping a memory layer into OpenHands takes 20 lines of config; Letta's OS-style explicit memory management is overkill for a single-user coding agent; Zep's temporal knowledge graph is strong but slower to wire.

  7. 07
    HardwareGPU (where the model runs)
    rtx-4090

    RTX 4090 24GB is the sweet spot for this stack: enough VRAM for Qwen 32B AWQ-INT4 + 32K context, enough memory bandwidth (1 TB/s) for sub-second TTFT, and consumer-grade thermals. The 5090 helps but isn't required; the 4080 16GB doesn't have headroom for the context window the agent actually needs.

Step-by-step setup

The four commands that take this stack from zero to working agent. Run them in order on a Linux box with CUDA 12.x already installed; on macOS, swap the GPU step for MLX-LM (see the Apple Silicon variation below).

1. Bring up vLLM with the coding model

# Pull the AWQ-INT4 quant — fits a 24GB card with 32K context
docker run --gpus all --rm -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.17.1 \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --enable-chunked-prefill

The --enable-chunked-prefill flag is non-optional — long-context prefills (the agent will routinely read 1000+ line files) will otherwise stall every other request for 1-3 seconds. gpu-memory-utilization=0.9 leaves ~2GB of VRAM headroom; lower it to 0.85 if you OOM on first inference.

2. Install the MCP servers

# Filesystem — strict allowlist limits agent blast radius
npx -y @modelcontextprotocol/server-filesystem ~/projects/myrepo

# Git — read-side repo metadata
npx -y @modelcontextprotocol/server-git --repository ~/projects/myrepo

Both run as stdio MCP servers — OpenHands launches them on demand and tears them down between sessions. Pin the allowlisted directory to one repo at a time; never point filesystem MCP at ~/ or your blast radius is your entire home directory.

3. Wire OpenHands to the stack

# config.toml
[llm]
model = "openai/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ"
api_base = "http://localhost:8000/v1"
api_key = "anything"  # vLLM doesn't check it

[mcp]
servers = [
  { command = "npx", args = ["-y", "@modelcontextprotocol/server-filesystem", "/home/you/projects/myrepo"] },
  { command = "npx", args = ["-y", "@modelcontextprotocol/server-git", "--repository", "/home/you/projects/myrepo"] }
]

[memory]
provider = "mem0"
config = { api_key = "local", host = "http://localhost:11434" }

4. Run a real task

# Drop OpenHands into Planning Mode for the first run
openhands run --plan-first \
  --task "Find the bug causing the auth_token validation to fail \
          on expired tokens and write a regression test"

The agent should: read the relevant files via filesystem MCP, examine recent commits via git MCP, draft a plan (with Planning Mode this is shown to you for approval), then make the edit and run the test suite. End-to-end on a real bugfix: 60-180 seconds. If your first run takes 10+ minutes, you have a configuration problem — see Failure Modes below.

Failure modes you'll hit

The list of things that go wrong with this stack, in rough order of how often we've seen them:

  1. vLLM OOM on first inference (not on load). The model loaded fine but the first request crashes. Lower --gpu-memory-utilization from 0.9 to 0.85, or drop --max-model-len from 32768 to 16384.
  2. Agent loops on plan revision. OpenHands keeps re-planning instead of executing — usually means the model isn't getting a clear enough “ok, plan approved, execute” signal. With Planning Mode, this is fixed by explicitly approving the plan in the UI; in headless mode, set plan_first = false after the first session.
  3. Filesystem MCP path-escape attempt. The allowlist is enforced; symptom is the agent reporting “permission denied” on files outside your repo. That's correct behaviour. If you need a wider scope, widen the allowlist deliberately rather than disabling it.
  4. Mem0 retrieves stale codebase context. The memory layer learned the codebase as it was 3 weeks ago; the agent now reasons against stale knowledge. Re-ingest after major refactors with mem0 reindex --workspace myrepo.
  5. vLLM prefix cache invalidation on every request. If your TTFT is 200-500ms instead of <50ms after the first call, your system prompt is templating variable user data. Move the variable parts to the user message; system prompt should be byte-identical across the agent loop.
  6. Test suite hangs the agent indefinitely. Long-running tests (integration suites that boot a database) blow past OpenHands' default tool-call timeout silently. Set per-tool timeouts in the MCP config or wrap your test runner in a hard deadline.

Variations and alternatives

Where this stack is wrong for your situation, the swap-in alternatives:

Apple Silicon variation. Replace vLLM + RTX 4090 with MLX-LM + M3 Max 64GB. The rest of the stack is unchanged. Throughput drops ~30-40% vs a 4090 but you trade GPU heat for a battery-powered laptop. See the Apple Silicon AI stack for the dedicated path.

Surgical-edits variation. If you want git-integrated CLI editing rather than autonomous task execution, swap OpenHands for Aider. Same model + runtime + MCP layer; different agent paradigm.

Higher-throughput agent-loop variation. Replace vLLM with SGLang if your stack does >10 tool calls per task on a stable system prompt. RadixAttention's tree-structured KV cache makes shared-prefix workloads ~1.3-1.7x faster. See the SGLang operational review for when this swap pays off.

Larger-codebase variation. For repos >1M tokens of context, swap Mem0 for Zep or Graphiti — temporal knowledge-graph memory holds long-horizon context better than flat vector retrieval.

How to verify the stack is healthy

The smoke tests we run on this stack:

  • Throughput: curl -X POST http://localhost:8000/v1/completions ... with a 100-token prompt should sustain >30 tok/s on a 4090. If you're below 20 tok/s, vLLM picked the wrong kernels — check NCCL_DEBUG=INFO output and pin the Docker image rather than running pip-installed.
  • TTFT: first-token latency for a cache-hit prefix should be <50ms. Repeat the same system prompt three times; the 2nd and 3rd should be much faster than the 1st. If they aren't, your prefix cache isn't hitting — see failure mode #5.
  • End-to-end: ask the agent to fix a deliberately-broken test in a small repo. Should complete in <3 minutes. If it takes >5, the loop is wrong.
  • Memory: close the agent, restart, ask “what did we change last session?” — Mem0 should surface the prior session's changes. If it doesn't, your memory provider isn't persisting.

Going deeper

The reference reading that backs every component pick in this stack: