Local AI for coding agents — what works, what breaks, and what to run
Building autonomous coding agents on local infrastructure in 2026. Agent loop architecture for Aider, Cline, OpenCode, Continue, and Claude Code with a local backend. Model class minimums, context window economics, realistic tool-use accuracy vs Claude/GPT, runtime picks, sandboxing, and where local agents earn their keep vs where they don't.
This is the agents-specific guide. The broader developer-tooling framing — IDE integrations, single-shot completions, daily-driver chat — lives at /guides/local-ai-for-developers. The end-to-end workflow recipe is /workflows/local-coding-agent-system.
Answer first
For autonomous coding agents — the kind that read your repo, edit files, run tests, iterate on errors, and produce a finished diff — the local-model floor in 2026 is Qwen2.5-Coder-32B at Q4 or DeepSeek-Coder-V2.5 at AWQ-INT4 on a 24 GB card. Below that you get a chatbot that can answer questions, not an agent that can finish a non-trivial task. The realistic ceiling on consumer hardware is Qwen2.5-Coder-32B at Q6 or 70B-class general models like Llama 3.3 70B or Qwen2.5-72B at Q4 on a 48 GB single-card or dual-3090 setup. None of these match Claude Sonnet 4.x or GPT-class frontier on tool-use reliability and end-to-end task completion in 2026 — they close enough of the gap that they earn a place in the stack for well-scoped tasks.
If you want the explicit hardware decision: /guides/best-gpu-for-local-ai-2026; for the dual-card vs single-flagship math: /guides/dual-3090-vs-single-5090.
What a coding agent loop actually is
A coding agent isn't a smarter chatbot. It's a loop:
- Read. The agent receives a task and reads enough of the repo to form a plan — file listings, target files, test files, related modules.
- Plan. It produces a structured plan: which files to edit, in what order, what tests to run.
- Act. It calls a tool — read file, edit file, run command, run tests — and gets a structured response back.
- Observe. It receives the tool output (file contents, diff, test result, error trace) and folds it back into context.
- Iterate. Until the task is complete, the loop repeats. Each iteration grows the context window with prior tool calls and observations.
This loop has three distinct ways to fail on local models. (1) The model picks the wrong tool or formats the call incorrectly — tool-use breakage. (2) The context window fills up after 6-15 iterations and the model starts dropping earlier reasoning. (3) The model reasons fine for one step but loses the thread across many steps — long-horizon reasoning collapse. Frontier cloud models fail less often on all three; that's the gap you're paying for when you use Claude or GPT.
The five agents in 2026
The agent landscape concentrated around five tools by 2026, all of which support a local OpenAI-compatible backend.
- Aider — terminal-first, opinionated, mature. Works well with local models because the prompts are tight and the file-edit format is forgiving. The honest first pick for solo developers running local.
- Cline (formerly Claude Dev) — VS Code extension, originally Claude-shaped, now happily speaks to any OpenAI-compatible endpoint. Tool-call heavy, so model tool-use accuracy matters here more than with Aider.
- OpenCode — open-source coding agent designed for local-first deployment. The cleanest local-native developer experience in 2026.
- Continue — IDE-integrated assistant that started as autocomplete and grew agent capabilities. Lower-friction adoption inside JetBrains and VS Code shops.
- Claude Code with a local backend — Anthropic's CLI agent supports custom OpenAI-compatible providers in 2026 builds, so you can point it at a local vLLM server. Caveat: the agent was tuned for Claude's tool-use behavior, and 32B local models will miss tool calls more often than the cloud product.
For each, the same three things matter: the model, the runtime, and the harness. The agent is the harness — the wrapping that defines the loop and the tool surface. None of these agents make a 14B model competent for non-trivial work; they expose, with painful clarity, exactly how much capability the underlying model has.
Model class — what you actually need
Concrete tiers based on what each model class can credibly accomplish in an agent loop in 2026.
- 7B and below. Autocomplete and single-shot completion only. Not viable for agentic loops; tool-use accuracy collapses below useful thresholds.
- 14B (Qwen2.5-Coder-14B, Phi-4-Coder). Marginal. Handles single-step tasks (“refactor this function”) but breaks on multi-file edits. Useful as a fast autocomplete sidecar, not as a primary agent driver.
- 32B (Qwen2.5-Coder-32B, DeepSeek-Coder-V2.5). The realistic floor for agentic work. Handles multi-file edits within a small repo, gets tool-call format right most of the time, recovers from errors at a useful rate.
- 70B-class general (Llama 3.3 70B, Qwen2.5-72B). Stronger reasoning across long traces, better at planning. The tradeoff is that they're generalists; on pure code-completion benchmarks the 32B coding-specialized models often match or exceed them.
- 100B+ MoE (DeepSeek-V3, Llama 4 Scout, Qwen3-235B-A22B). Closest open-weight thing to frontier in 2026. The catch is that you need 192-512 GB of memory to run them — Mac Studio M3 Ultra territory or serious multi-GPU. For most operators these are out of reach without rented hardware.
See /glossary/quantization for what Q4, AWQ, and EXL2 mean and how they interact with hardware.
Context window pressure
Coding agents are context monsters. A 30-step agent loop on a 32B model can produce 25-50K tokens of context by the end — file contents, diffs, tool calls, observations, the original plan. The published context window is a ceiling, not a comfort zone.
The practical numbers in 2026: Qwen2.5-Coder-32B advertises 128K context. On a 24 GB card running Q4_K_M, the comfortable working window is roughly 16-24K tokens before KV cache growth squeezes the model itself. A 70B Q4 model on dual-3090 (48 GB pooled) gives you 32-48K of comfortable agent context. To get to 64K+ comfortably you want 48 GB on one card (RTX A6000, RTX 6000 Ada) or a multi-GPU setup with FP8 KV-cache quantization.
Why this matters: the moment the agent loop is forced to truncate or summarize earlier turns, the model starts losing the plot. A long task that succeeded at iteration 5 fails at iteration 15 because the context that made iteration 5 work was paged out. This is a memory-shaped failure mode that doesn't exist on cloud APIs (where the published context window is the available context window). Plan accordingly.
Tool-use accuracy reality check
The single number that matters most for an agent and the one nobody benchmarks honestly is what fraction of the time does the model output a syntactically valid, semantically correct tool call when it should. On Claude Sonnet 4.x in 2026 this is essentially 100% on the common tool surface — file read, file edit, bash, search. On GPT-class frontier it's within a hair of that.
On local models the picture is rougher. Qwen2.5-Coder-32B at Q4_K_M, with a well-tuned tool-use prompt, lands tool calls correctly an estimated 88-94% of the time across the common surface. DeepSeek-Coder-V2.5 lands closer to 92-96%. Llama 3.3 70B at Q4 lands around 90-93%. These are working numbers from operator reports rather than published benchmarks (the local-agent tool-use bench isn't standardized in 2026), and they vary noticeably by harness — some harnesses retry failed tool calls automatically and surface a higher effective success rate; others don't.
The gap from 100% to ~92% sounds small. It is not small. A 30-step agent loop with a 92% tool-call success rate has a 0.9230 = 8% chance of completing the task without a tool-call retry; a frontier model at 99.5% success has a 0.99530 = 86% chance. The retry cost — the agent thrashes, recovers, eventually gets the right call — is what you feel as “local agents are flaky.” They aren't broken; they have a worse compounding probability across long traces.
When local agents work
- Well-scoped, repetitive tasks. “Add a logging line to every public method in this module,” “migrate these 14 imports from path A to path B,” “update the test fixtures after the schema change.” Bounded scope, predictable structure, short loops.
- Fast feedback loops on a known codebase. When the model has seen the repo conventions and the task is incremental, the compounding-failure problem matters less because the loops are short.
- Privacy-required workflows. Code under NDA, regulated environments, classified codebases. Local is the only viable path; you take the quality hit because you have no choice.
- Bulk refactoring with human review. The agent does the mechanical work, the human reviews the diff. Tool-use occasional failures are caught at review time.
- As a sidecar for autocomplete. The 14B-class model running locally as fast inline completions while the heavier work routes to a frontier API. This hybrid is genuinely the right answer for most working developers in 2026.
When local agents fail
- Open-ended product work. “Build the auth flow,” “design the new admin dashboard,” “ship the migration.” Long horizons, high uncertainty, branching decisions. Local models lose the thread; frontier models barely hold it.
- Cross-repo or microservice work. When the agent has to reason about behavior split across multiple repos, the context budget collapses fast.
- Subtle bug investigation. Tracking down a heisenbug across an agent loop requires the kind of careful, branching investigation that 32B models reliably mishandle.
- Production-critical changes without review. The 8% catastrophic-failure tail on a 30-step loop is fine when a human reviews the diff; it is not fine when the agent auto-merges to main.
Runtime picks for serving
The runtime you choose shapes the agent experience as much as the model. Three honest recommendations for 2026:
- vLLM for serving a coding agent to one or many users. Continuous batching, OpenAI-compatible API, mature, fast on NVIDIA. The default production pick. vLLM vs SGLang is the live debate; SGLang wins on radix-attention prefix caching for repeated agent loops, which actually matters for codebase-heavy contexts. vLLM vs llama.cpp for the “is this overkill for solo use?” question — usually yes.
- llama.cpp via Ollama for solo developers on a single machine. Single-stream throughput is fine for one user; the operational surface is dramatically smaller. The right pick for a developer running an agent on their workstation rather than serving a team.
- ExLlamaV2 or TensorRT-LLM for the maximum-speed solo pick on NVIDIA. EXL2 quantization with ExLlamaV2 produces the highest single-stream tok/s on a 24 GB card; TensorRT-LLM is the production-grade enterprise NVIDIA path. Both add operational complexity.
Sandboxing and security
A coding agent that executes commands on your machine is a security surface you have to take seriously, especially when the model can be tricked by malicious content in retrieved files, search results, or untrusted PRs. Local agents inherit this problem from cloud agents and arguably worsen it because the rig is on your network rather than in a vendor sandbox.
Practical defenses for 2026: (1) Run the agent in a Docker container or a dedicated VM with a shared volume mount, not directly on your host. (2) Use a non-privileged user account that can't reach the rest of the network. (3) Limit the tool surface — file read/write within the project tree, no arbitrary bash, no network access except to the local model server. (4) Treat agent-edited code as if it came from a stranger's pull request: review every diff. (5) Pin model versions and runtime versions; a model swap can change tool-use behavior in ways that surprise the harness.
See /paths/local-coding-agent for the full deployment recipe and /workflows/local-coding-agent-system for the validated end-to-end workflow.
Closing
Local coding agents in 2026 are real, useful, and not yet at parity with frontier cloud agents on hard tasks. The honest position: use them for what they're good at — bounded, repetitive, privacy-constrained, fast-feedback work — and use cloud agents for the rest until the gap closes further. The hardware floor (24 GB VRAM, 32B-class model, vLLM or Ollama runtime) is reachable for a few hundred dollars of used hardware; the operational ceiling (48 GB+, 70B-class, multi-user serving, sandboxed) is a workstation project, not an afternoon. Pick the tier that matches the task you actually have, not the one the marketing page suggested.
Next recommended step
Deployment recipe, model picks, sandboxing, observability.
Coding agents demand sustained high-throughput inference across long context windows — a single debugging session might push tens of thousands of tokens through the model in under a minute. That kind of workload punishes GPUs that are fine for casual chat. The hardware tier that reliably drives coding agents sits a notch above what most people budget for, and knowing which GPU actually hits the latency target saves you from buying twice.
The GPU tier where coding agents become practical: best GPU for Qwen, and custom GPU comparison.