RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Tasks/Coding/Agentic Coding
Coding
coding agent
ai pair programming
claude code
cursor

Agentic Coding

Multi-step autonomous coding agents that read repos, edit files, run tests. Aider, Cline, OpenHands are open-weight tooling leaders.

Capability notes

Coding agents in 2026 — [Aider](/tools/aider), [Cline](/tools/cline), and OpenHands — operate in a loop: read repo context → plan edit → apply diff → run tests → evaluate results → iterate. SWE-bench Verified scores: Aider + Claude 3.7 Sonnet = 48.5%, OpenHands + DeepSeek V4 = 51.2% — autonomously resolving roughly half of real-world GitHub issues requiring multi-file edits. This is up from 15–25% in early 2024, driven by frontier model improvements and better agent architectures. Agentic coding differs from code generation in scope: a completion tool suggests one function; an agent modifies 3–15 files, writes tests, runs them, and iterates on failures. The agent maintains state across tool calls — reading files, executing commands, parsing errors, applying fixes. Model quality drives agent performance more than agent architecture. The same framework with [DeepSeek V4](/models/deepseek-v4) scores 2–3× higher on SWE-bench than with [Llama 3.3 70B](/models/llama-3-3-70b). Frontier MoE models handle multi-step reasoning, tool-use sequencing, and error recovery better. Serious agentic coding requires 70B-class or frontier MoE. 32B models handle single-file edits and simple test-fix loops. 7B models cannot reliably complete agentic workflows — they lose context after 3–5 tool calls. The architecture that works: architect mode (planning model) + edit mode (execution model). The architect plans the multi-file change; the editor applies specific diffs. This separation reduces context pollution — the architect reasons about the full repo while the editor works on one file at a time. [Aider](/tools/aider) implements this with `--architect`; [Cline](/tools/cline) via "Plan" vs "Act" mode.

If you just want to try this

Lowest-friction path to a working setup.

Start with [Aider](/tools/aider) using [DeepSeek V3](/models/deepseek-v3) via OpenRouter — zero hardware setup: ```bash pip install aider-chat export OPENROUTER_API_KEY="sk-or-v1-..." aider --model openrouter/deepseek/deepseek-chat-v3 ``` This gives you a terminal-based agent that reads your repo, makes multi-file edits, stages git commits, and iterates on test failures. [DeepSeek V3](/models/deepseek-v3) via OpenRouter costs ~$0.89/M input tokens and ~$1.10/M output — a typical hour-long session costs $3–8. Move to local for zero-cost agentic coding once comfortable. Install [Ollama](/tools/ollama): ```bash ollama pull llama3.3:70b-instruct-q4_K_M aider --model ollama_chat/llama3.3:70b-instruct-q4_K_M ``` Hardware: 24 GB+ VRAM for 70B Q4 — [RTX 4090](/hardware/rtx-4090), [RTX 5090](/hardware/rtx-5090), or [RTX 3090](/hardware/rtx-3090). On [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) 64 GB+, the model runs on SoC. For [Cline](/tools/cline), install the VS Code extension, configure API provider to "Ollama" at `http://localhost:11434`, select [Llama 3.3 70B](/models/llama-3-3-70b) or [DeepSeek V4](/models/deepseek-v4). Cline's VS Code integration gives inline diff previews missing from Aider's terminal. Start with small tasks: "fix typo in README," "add unit test for parse_config," "refactor this 80-line function into two helpers." Graduate to multi-file refactors once you know when the agent succeeds vs needs human guidance. Rule of thumb: if the change can be described in 2–3 sentences with clear file paths and function names, the agent can do it. If design decisions span ambiguous requirements, supervise manually.

For production deployment

Operator-grade recommendation.

Production agentic pipelines combine an orchestrator, model backend, sandboxed execution, and monitoring. **Planning model (architect):** frontier-tier reasoning ([DeepSeek V4](/models/deepseek-v4), Claude 3.7 Sonnet via API). Reads issue description, explores repo, selects relevant files, drafts multi-file edit plan. A single plan invocation costs 20K–50K input + 2K–5K output tokens. On [DeepSeek V4](/models/deepseek-v4) via [vLLM](/tools/vllm) on [H100 PCIe](/hardware/nvidia-h100-pcie): 15–30 seconds. **Editing model (executor):** 70B-class ([Llama 3.3 70B](/models/llama-3-3-70b), [Qwen 3 32B](/models/qwen-3-32b)). Applies individual file diffs from the plan, receiving file content + specific edit instruction. Each diff costs 3K–10K tokens. Local execution avoids per-token API charges on the high-volume editing step. **Sandbox:** every agent-generated command executes in an isolated container (Docker/Podman) with restricted network egress and filesystem limited to a repo clone. [Aider](/tools/aider)'s `--no-auto-commits` + `--yes` with Docker wrapper provide basic sandboxing. [Cline](/tools/cline)'s "require approval" mode flags dangerous commands (rm -rf, git push --force, curl to unknown hosts, anything touching /etc or /home outside project dir). **Test-fix loop budget:** each iteration costs 30–90 seconds with local inference. Cap at 5 iterations per issue. Beyond 5 without passing tests, escalate to human. Track "fix ratio" — % of iterations that improve test results. Below 40% means model is thrashing — terminate and restart with revised plan. **When agents break.** Agents fail when: (1) the issue requires understanding undocumented architecture decisions, (2) the fix touches 10+ files with interdependencies (context window exceeds 128K), (3) the test suite takes 60+ seconds, or (4) the agent encounters a novel error never seen in training. Implement circuit-breaker: 3 consecutive no-improvement iterations → halt. Total runtime exceeds 15 minutes → halt. **Cost.** Local inference on [H100 PCIe](/hardware/nvidia-h100-pcie) at ~$2.50/hr processes ~200–400 iterations/hour. API-based agents cost $0.15–0.75/iteration. For 50 issues/day at 4 iterations each = 200 iterations/day, local saves $28–148/day vs API — a $10,000–54,000/year differential.

What breaks

Failure modes operators see in the wild.

**Agent loop divergence.** Symptom: agent repeatedly edits same files, never converging — codebase worse than baseline after 5+ iterations. Cause: model lacks global understanding of side effects — fixes A, breaks B. Mitigation: cap at 5 iterations, require new planning step before each iteration above 3, alert when any file touched more than twice. **Infinite repair cycles.** Symptom: "apply edit → tests fail → apply same fix → same failure → repeat." Cause: error-recovery reasoning generates the same fix because it cannot identify root cause from test output alone. Mitigation: deduplication check — if the same diff is proposed twice, stop and inject the actual error message with directive to "explain root cause before generating fix." **Context pollution.** Symptom: after 10+ tool calls, agent hallucinates file contents, references nonexistent variables, proposes edits to wrong files. Cause: full conversation history including tool outputs accumulates — by iteration 8 at 128K context, 70% is tool output history. Mitigation: architect-editor pattern — architect gets full repo context; editor gets only current file + specific edit instruction. Archive tool outputs after each iteration; summarize rather than retaining raw stdout. **Git state corruption.** Symptom: agent commits broken changes, pushes to main directly, force-pushes, or creates merge conflicts. Cause: agent tool-use permissions grant git commands without guardrails. Mitigation: never grant git push. Restrict git commit to dedicated agent branch. Require .agent-guardrails file blocking force push, checkout main, hard reset. Validate git state before and after each invocation. **Unsafe command execution.** Symptom: agent runs rm -rf, chmod -R 777, unchecked SQL migrations, downloads remote scripts. Cause: model treats all shell commands as equally valid with no risk model. Mitigation: run in container with read-only rootfs except project clone, block network egress during agent runs, maintain blocklist (rm -rf, chmod -R, git push --force, sudo, pip install, curl | bash) requiring explicit human approval. **Test blindness.** Symptom: agent claims "all tests pass" but invocation was a no-op (wrong runner, wrong directory, tests skipped). Cause: model conflates "test exit code 0" with "tests actually ran." Mitigation: require agent to report exact command, number of tests run, and duration. Parse output — "0 tests ran" = failure.

Hardware guidance

**Hobbyist tier ($1,500–2,500 GPU).** Agents are the most demanding local AI workload. [RTX 4090](/hardware/rtx-4090) at 24 GB runs 70B Q4 with 16K–32K context — minimum viable consumer card for unilateral agentic work. Expect 22–28 tok/s, iterations of 45–90 seconds. [RTX 5090](/hardware/rtx-5090) at 32 GB: 40–55 tok/s at 32K context — iterations drop to 25–50 seconds. Dual [RTX 3090](/hardware/rtx-3090) (48 GB): 25–35 tok/s at 64K context — the used-market sweet spot. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 128 GB: 20–30 tok/s with 64K+ context — the laptop pick. **SMB tier ($6,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB with 960 GB/s: 80–100 tok/s — iterations drop to 15–25 seconds. Makes agentic coding feel interactive. [L40S](/hardware/nvidia-l40s) at 48 GB: similar datacenter performance. Single card handles 5–8 concurrent agent instances via vLLM continuous batching. **Enterprise tier ($25,000+).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB with 2.0 TB/s: 140–170 tok/s — iterations drop to 8–15 seconds. 80 GB fits 70B FP8 with 64K context plus KV cache for concurrent sessions. [H200](/hardware/nvidia-h200) at 141 GB: fits [DeepSeek V4](/models/deepseek-v4) at FP8 with 128K context — 50+ concurrent agent instances. [AMD MI300X](/hardware/amd-mi300x) at 192 GB: fits DeepSeek V4 at FP16 with 128K context and 100+ concurrent sessions. **Context window scaling.** Each 1K tokens of context consumes ~0.8–1.2 GB KV cache for 70B models. 24 GB GPU running 70B Q4 (~40 GB) with partial offload: ~8 GB for KV = ~8K context — borderline. 48 GB GPU running 70B FP8 (~35 GB): ~13 GB for KV = ~16K context — adequate. 80 GB GPU: ~45 GB for KV = ~64K context — comfortable. Choose hardware by context window needed: 16K minimum for basic agentic, 32K for multi-file refactors, 64K for repository-scale.

Runtime guidance

**Aider vs Cline vs OpenHands — agent architecture determines model compatibility.** [Aider](/tools/aider) is a terminal-based agent: repo-map → prompt assembly → model response → search/replace block extraction → file edit → git commit → test run → iterate. Aider uses search/replace blocks as edit format, making it the most model-agnostic agent — works with 70B local models, frontier cloud APIs, and everything in between. Architect mode (`--architect`) splits planning and editing across two models for editing). Tradeoff: terminal-only, no GUI diff preview (rely on git diff), repo-map generation adds latency. [Cline](/tools/cline) is a VS Code extension acting as autonomous agent with file read/write, terminal execution, browser access, and MCP tool integration. Architecture: user prompt → plan → execute tool → observe → replan → repeat. Works with any OpenAI-compatible API including local [Ollama](/tools/ollama) and [vLLM](/tools/vllm). Advantage: deep VS Code integration — inline diffs, per-hunk accept/reject, real-time observation. Tradeoff: tool-use event loop consumes 2–4× more context per iteration than Aider's structured edit format. **OpenHands** (formerly OpenDevin) is a web-based platform running in sandboxed Docker with full shell, filesystem, and browser. CodeAct architecture: task description → write code → execute → iterate. Achieves highest SWE-bench Verified scores (51.2% with [DeepSeek V4](/models/deepseek-v4)) among open-weight agent frameworks. Tradeoff: web UI deployment complexity, Docker sandbox infrastructure, local model support depends on vLLM integration. **Architect vs unified mode.** Architect mode: planning model receives full repo context (file tree + docstrings via repo-map) and produces change plan; editing model receives individual files + specific instructions. Reduces context consumption 40–60%. Essential for local models with limited context. Unified mode: single model plans and edits — simpler but requires larger context and stronger models. Works with [DeepSeek V4](/models/deepseek-v4); degrades with [Llama 3.3 70B](/models/llama-3-3-70b) after 3–4 iterations. **Claude API vs local.** Claude 3.7 Sonnet scores 60–80% higher SWE-bench than best open-weight 70B, recovers from tool-use errors 3× more reliably, handles 200K context natively. Tradeoff: $3/M input, $15/M output, code goes to Anthropic.

Setup walkthrough

  1. Install Ollama → ollama pull qwen2.5-coder:14b (~9 GB).
  2. pip install aider-chat (requires Python 3.10+).
  3. Create a small test repo: mkdir test-agent && cd test-agent && git init.
  4. aider --model ollama_chat/qwen2.5-coder:14b → opens the aider TUI.
  5. Ask: "Create a Python CLI app that takes a filename and counts lines, words, and characters." Aider reads the repo, writes the file, and asks to run it.
  6. First useful output in 1-2 minutes end-to-end.

For VS Code users: install the Cline or Continue extension → point to Ollama → same models. Cline with DeepSeek Coder V3 on 24 GB GPU handles multi-file edits, runs tests, reads linter output.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Qwen 2.5 Coder 14B Q4_K_M at 25-35 tok/s — viable for aider on small-to-medium repos (<50k lines). Response latency ~3-8 seconds per tool-use round. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$400-480. Expect 5-15 second turnaround per edit cycle on repos under 100 files.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs DeepSeek Coder V3 at 15-20 tok/s — viable for multi-file agentic workflows. Qwen 2.5 Coder 32B Q6_K at 35-50 tok/s for faster iteration. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For heavy agentic use (Cline-style with tool-use loops), a second RTX 3090 raises to 48 GB and doubles throughput on DeepSeek V3.

Common beginner mistake

The mistake: Running aider/Cline with a 7B coding model and getting frustrated when it can't handle multi-file refactors. Why it fails: 7B models lack the context-reasoning depth needed for agentic tool-use — they forget instructions mid-task, lose track of which file they're editing, and fail at multi-step planning. The fix: Minimum viable is 14B (Qwen 2.5 Coder 14B). For production agentic coding, use 32B+ (Qwen 2.5 Coder 32B or DeepSeek Coder V3). The jump from 7B to 14B is the single biggest quality improvement in agentic coding.

Recommended setup for agentic coding

Recommended hardware
Best GPU for Ollama (coding workflows) →
Code models work great on Ollama; 16 GB minimum.
Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for Ollama (coding workflows) →

Reality check

Code models are LLM workloads — same VRAM math applies. 16 GB runs 13-32B Q4 (Qwen 2.5 Coder, DeepSeek Coder); 24 GB unlocks 70B-class code models. The killer detail is context window — code review wants 32K+, which pushes KV cache beyond 16 GB on 70B.

Common mistakes

  • Skipping context-window math (KV cache eats VRAM at scale)
  • Using base instruct models for code (specialized code models 30-50% better)
  • Running coding agent loops on 8 GB (works for 7B but agent loops compound)
  • Forgetting flash-attention impacts code workflows more than chat

What breaks first

The errors most operators hit when running agentic coding locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • Tokenizer mismatch →

Before you buy

Verify your specific hardware can handle agentic coding before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →
Hardware buying guidance for Agentic Coding

Local coding workflows live or die on time-to-first-token and 32K+ context. The guides below cover the developer-specific hardware decision.

  • best GPU for Qwen
  • AI PC build for developers

Featured models

DeepSeek Coder V3

Related tasks

Coding Agents
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →