Local coding agent: ship code with a local model
For: Developers who want autonomous or surgical-edit coding agents driven by a local model. By the end: A coding agent integrated with your editor or terminal, pointed at a local model, with realistic context and tool boundaries.
A local coding agent is the most operationally satisfying use of a discrete GPU on your desk — when it works. When it doesn't, you'll spend a Saturday on a configuration problem that has nothing to do with the model. This path walks you through agent choice, model fit, context budgets, and the boundaries that keep an autonomous loop from rewriting your repo by accident.
Decide what kind of agent you actually want
These are three different products in three different shapes. Aider is git-aware terminal pair-programming — describe a change, it edits files and commits. Continue.dev is an in-editor sidekick that completes and chats with your code. Cline (and OpenHands) are autonomous task runners that plan, execute, test, and iterate.
Pick one. Don't run all three at once on the same repo until you've used one for a week. Each has its own context-budget assumptions and its own failure modes.
Match the model to your hardware honestly
Surgical-edit agents work fine with 7B-14B coders on a single 16GB card. Autonomous agents need bigger context and stronger reasoning — Qwen 2.5 Coder 32B in AWQ-INT4 is the modern reference for a 24GB card, larger if you have it. Don't try to run Cline against a 7B model and a 4096 context window; you will hate the result.
Cross-check the model's parameter count against your VRAM in the right quantization, with headroom for the KV cache. If you don't know, the VRAM calculator is the right tool.
Stand up an OpenAI-compatible endpoint
All three agent options speak OpenAI's chat-completions schema. vLLM is the production-grade choice on Linux + a single 24GB+ NVIDIA card. llama.cpp is the universal fallback. LM Studio is the easy mode if you don't want to think about it. The agent doesn't care which one you use.
Verify with curl before pointing the agent at it. If curl doesn't return JSON, the agent never will.
Wire the agent to your endpoint
All three agents will accept localhost:8000 (or whatever port your runtime serves) as their OpenAI base URL. The "API key" can literally be the string "anything" — vLLM and llama.cpp don't validate it. The model name should match what your runtime advertises in /v1/models.
First task: ask the agent to add a docstring to one function in one file. If that round-trip works, the wiring is correct. If it fails, the wiring is wrong; don't blame the model yet.
Get context budgets right
The most common cause of bad agent behavior on local models is silent context truncation. Your runtime is set to 8K, the agent thinks it has 32K, the agent dumps a big file in, the runtime drops the system prompt, the model goes off the rails. Fix: align all three.
Set --max-model-len on vLLM (or --ctx-size on llama.cpp) to a number you can support in VRAM. Tell the agent the same number. Watch your VRAM during a real task. If KV cache is eating your headroom, drop the context.
Set tool boundaries the agent can't escape
Autonomous agents will refactor your home directory if you let them. Don't let them. Filesystem MCP servers take an allowlist; pin it to one repo at a time. Git operations should be explicit (the agent says "I will commit X with message Y") rather than implicit. The agent should run your tests before declaring done, not after.
Local coding agent stack on /stacks/local-coding-agent documents the four MCP setup commands and the failure modes you'll hit; read it before pointing an autonomous loop at a real codebase.
Build the muscle memory of when to stop the loop
Local coding agents on local models are not as patient or as smart as their cloud cousins. The single biggest skill you can develop: knowing when to interrupt, give better context, and re-prompt rather than letting the agent iterate badly for ten minutes. This is an operator skill, not a configuration problem.
Next recommended step
The reference recipe: OpenHands + Qwen 2.5 Coder 32B + vLLM + Mem0 + MCP, with setup commands and failure modes.