Operator path
Operator-reviewed

Local coding agent: ship code with a local model

For: Developers who want autonomous or surgical-edit coding agents driven by a local model. By the end: A coding agent integrated with your editor or terminal, pointed at a local model, with realistic context and tool boundaries.

By Fredoline Eruo7 milestonesLast reviewed 2026-05-07

A local coding agent is the most operationally satisfying use of a discrete GPU on your desk — when it works. When it doesn't, you'll spend a Saturday on a configuration problem that has nothing to do with the model. This path walks you through agent choice, model fit, context budgets, and the boundaries that keep an autonomous loop from rewriting your repo by accident.

Decide what kind of agent you actually want

These are three different products in three different shapes. Aider is git-aware terminal pair-programming — describe a change, it edits files and commits. Continue.dev is an in-editor sidekick that completes and chats with your code. Cline (and OpenHands) are autonomous task runners that plan, execute, test, and iterate.

Pick one. Don't run all three at once on the same repo until you've used one for a week. Each has its own context-budget assumptions and its own failure modes.

When this is done you should have
A clear answer: surgical-edit (Aider), conversational with VSCode (Continue.dev), or autonomous task loop (Cline / OpenHands).

Match the model to your hardware honestly

Surgical-edit agents work fine with 7B-14B coders on a single 16GB card. Autonomous agents need bigger context and stronger reasoning — Qwen 2.5 Coder 32B in AWQ-INT4 is the modern reference for a 24GB card, larger if you have it. Don't try to run Cline against a 7B model and a 4096 context window; you will hate the result.

Cross-check the model's parameter count against your VRAM in the right quantization, with headroom for the KV cache. If you don't know, the VRAM calculator is the right tool.

When this is done you should have
A coding model loaded that fits your VRAM with 2-4GB headroom and runs at usable tok/s for the agent loop you picked.

Stand up an OpenAI-compatible endpoint

All three agent options speak OpenAI's chat-completions schema. vLLM is the production-grade choice on Linux + a single 24GB+ NVIDIA card. llama.cpp is the universal fallback. LM Studio is the easy mode if you don't want to think about it. The agent doesn't care which one you use.

Verify with curl before pointing the agent at it. If curl doesn't return JSON, the agent never will.

When this is done you should have
vLLM, llama.cpp, or LM Studio serving on localhost with the chat-completions API. A curl test that returns a sensible completion.
Read nextvLLMllama.cpp

Wire the agent to your endpoint

All three agents will accept localhost:8000 (or whatever port your runtime serves) as their OpenAI base URL. The "API key" can literally be the string "anything" — vLLM and llama.cpp don't validate it. The model name should match what your runtime advertises in /v1/models.

First task: ask the agent to add a docstring to one function in one file. If that round-trip works, the wiring is correct. If it fails, the wiring is wrong; don't blame the model yet.

When this is done you should have
The agent of your choice configured with the local endpoint, the right model name, and a fake API key. A first task completed end-to-end.

Get context budgets right

The most common cause of bad agent behavior on local models is silent context truncation. Your runtime is set to 8K, the agent thinks it has 32K, the agent dumps a big file in, the runtime drops the system prompt, the model goes off the rails. Fix: align all three.

Set --max-model-len on vLLM (or --ctx-size on llama.cpp) to a number you can support in VRAM. Tell the agent the same number. Watch your VRAM during a real task. If KV cache is eating your headroom, drop the context.

When this is done you should have
Agent configured with a context window that matches your runtime's max-model-len setting and your model's training context. No silent truncation.

Set tool boundaries the agent can't escape

Autonomous agents will refactor your home directory if you let them. Don't let them. Filesystem MCP servers take an allowlist; pin it to one repo at a time. Git operations should be explicit (the agent says "I will commit X with message Y") rather than implicit. The agent should run your tests before declaring done, not after.

Local coding agent stack on /stacks/local-coding-agent documents the four MCP setup commands and the failure modes you'll hit; read it before pointing an autonomous loop at a real codebase.

When this is done you should have
Filesystem allowlist scoped to one project at a time. Git operations explicit, not implicit. A test command the agent runs before claiming success.

Build the muscle memory of when to stop the loop

Local coding agents on local models are not as patient or as smart as their cloud cousins. The single biggest skill you can develop: knowing when to interrupt, give better context, and re-prompt rather than letting the agent iterate badly for ten minutes. This is an operator skill, not a configuration problem.

When this is done you should have
A working sense of when the agent is making progress vs spinning. A documented practice for restarting a task with better context when it isn't.

Next recommended step

The reference recipe: OpenHands + Qwen 2.5 Coder 32B + vLLM + Mem0 + MCP, with setup commands and failure modes.