RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Local coding-agent system
Homelab
Week build-out

Local coding-agent system

End-to-end local autonomous coding agent. vLLM serving Qwen 2.5 Coder 32B, OpenHands as the agent controller, Open WebUI for chat, Qdrant + nomic-embed for code RAG, bge-reranker for retrieval, Redis for the agent queue, Docker sandbox for code execution, Caddy reverse proxy, Tailscale for remote access. The whole system on one workstation.

By Fredoline Eruo · Reviewed 2026-05-07 · ~2,200 words

Build summary

Hardware footprint
RTX 4090 24 GB · 64 GB DDR5 · 1 TB NVMe Gen4 · Ubuntu 24.04 LTS
Concurrency
1-2 concurrent users (one human + one agent loop). Higher needs SGLang.
Power
~450 W under sustained load; budget a 1000 W Gold PSU.

Goal: An autonomous coding agent that learns the codebase, executes multi-file edits, runs tests, and never sends a token to a cloud.

Operator card

Workflow
Best for
  • ✓Solo developer who wants an autonomous coding agent
  • ✓Privacy-sensitive teams that can't ship code to cloud LLMs
  • ✓Engineers who want to learn how vLLM + agent loops actually run
  • ✓Homelab operators with a single 4090 already
Avoid if
  • ⚠You need >2 concurrent users (move to SGLang or production tier)
  • ⚠Your codebase is >10M LoC (the embeddings index becomes the bottleneck)
  • ⚠You can't dedicate a workstation 24/7 to this
  • ⚠Apple Silicon — see /workflows/private-chatgpt-replacement instead
Stability
stable
Maintenance
Weekly attention
Skill
Advanced
Long-session reliability
reliable

Service ledger

11 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
vLLM
Inference
8000/tcp (OpenAI-compatible)
Inference engine. AWQ-INT4 path lets Qwen 2.5 Coder 32B fit a single 4090 with 32K context plus headroom. Continuous batching handles the agent's tool-call burst pattern.
Runs: Docker container, GPU 0 dedicated
Qwen 2.5 Coder 32B Instruct (AWQ-INT4)
Model
Coding LLM. SWE-Bench Lite competitive at the 32B size class, Apache 2.0 license, strong tool-calling discipline. AWQ-INT4 fits 24 GB with 32K context.
Runs: loaded into vLLM at boot
nomic-embed-text-v1.5
Embeddings
Code embeddings. Open-weights bi-encoder, MTEB-competitive, 137M params runs comfortably on the same GPU alongside the LLM via a second vLLM container or llama.cpp.
Runs: llama.cpp server, GPU 0 shared
bge-reranker-v2-m3
Reranker
Retrieval reranker. Cross-encoder rerank lifts top-10 retrieval precision ~25% over bi-encoder alone; small enough to run on CPU when the GPU is busy.
Runs: FastAPI sidecar on CPU
Surface
Open WebUI
Frontend
8080/tcp
Chat frontend. Routes at vLLM's OAI-compatible endpoint; supports per-conversation memory and multi-model switching. LibreChat is the alternative if you need MS365/AAD auth.
Runs: Docker container
Data
Qdrant
Vector DB
6333/tcp
Vector DB. Production-quality HNSW + payload filtering; Rust performance; single-binary easy to back up. pgvector is the alternative when you already operate Postgres.
Runs: Docker container, named volume
Redis 7
Queue
6379/tcp (loopback only)
Job queue. OpenHands queues long-running tasks via Redis; lighter than Postgres-backed queues for this workload.
Runs: Docker container
Operations
Docker (rootless)
Sandbox
Code-execution sandbox. OpenHands spawns ephemeral containers per task; rootless mode prevents agent-driven privilege escalation if a model misbehaves.
Runs: host docker daemon
Caddy
Proxy / TLS
443/tcp
Reverse proxy + TLS. Auto-Let's-Encrypt; sane defaults; tiny config file. Traefik is the alternative when you need Docker-discovery routing.
Runs: host systemd unit
Tailscale
Auth
Remote access. WireGuard-based mesh; brings the workstation onto your devices without exposing 443 publicly. Cloudflare Tunnel is the alternative for browser-only access.
Runs: host service
Prometheus + Grafana + nvidia-dcgm-exporter
Observability
Metrics + dashboards. vLLM exposes Prometheus metrics natively (queue depth, decode tok/s, KV-cache utilization); dcgm-exporter adds GPU temp/power/memory. The Loki addon is optional for log aggregation.
Runs: Docker compose stack

Hardware

RTX 4090 24 GB is the floor. The 32B AWQ-INT4 model takes 16 GB; KV cache for 32K context takes ~3-4 GB; OS + driver overhead ~2 GB. Headroom for the embeddings model (300 MB) and the dcgm-exporter footprint is comfortable.

CPU should be modern high-end (Ryzen 7 7950X or i7-14700K class). 64 GB DDR5 is the practical floor — Qdrant's HNSW indices, Redis, the Docker daemon, and the OS all want RAM. NVMe Gen4 because Qdrant snapshots and OpenHands sandbox spinup hit storage hard.

PSU must be Gold-rated 1000 W minimum. Transient spikes on a 4090 hit 600 W+ for milliseconds; cheaper PSUs trip OCP under sustained inference + agent loop bursts.

Storage

Plan ~150 GB just for the model weights and runtime images. The vector DB grows ~1 KB per code chunk × ~10K chunks per repo = ~10 MB per repo, but the model snapshots and OpenHands workspaces dominate.

Embeddings re-ingestion is the SSD-wear surface. A daily full re-ingest of a 100K-file monorepo writes ~1-3 GB. Consumer NVMe drives are rated 600-1200 TBW; you won't hit that in years of normal use, but plan for incremental ingestion (watcher-based) instead of nightly full rebuilds.

Back up Qdrant volumes + OpenHands state nightly. docker run --rm -v qdrant_data:/data -v $(pwd)/backup:/backup alpine tar czf /backup/qdrant-$(date +%F).tgz /data is the lazy operator pattern.

Networking

Bind every service to localhost EXCEPT Caddy on 443 and Tailscale's interface. The agent's Docker sandbox containers should run with --network=none unless they explicitly need network egress (e.g. a tool that hits a stub HTTP API in tests).

Open WebUI behind Caddy + Tailscale is the recommended remote-access pattern — never expose Open WebUI directly to the public internet. The auth surface inside Open WebUI is fine for a single user but does not stand up to internet-facing scrutiny.

Inside Tailscale: enable MagicDNS so you can hit workstation/openwebui from your laptop without remembering the tailnet IP.

Observability

vLLM's Prometheus endpoint is the single most important metric source. Watch:

  • vllm:e2e_request_latency_seconds (p99)
  • vllm:gpu_cache_usage_perc (KV-cache pressure)
  • vllm:num_requests_running (concurrency)
  • DCGM_FI_DEV_GPU_TEMP (sustained ≥ 80 °C means more airflow needed)
  • DCGM_FI_DEV_POWER_USAGE (sustained ≥ 420 W means PSU under stress)

Set Grafana alerts on (a) GPU temp ≥ 84 °C for ≥ 5 min, (b) KV-cache utilization ≥ 90 % for ≥ 60 s (context likely OOM-bound), (c) OpenHands queue depth ≥ 5 (agent loop falling behind).

Logs: Open WebUI and OpenHands log to stdout; pipe through Loki if you want grep-able history. Without Loki, Docker's default 10 MB / file × 3 files cap is enough for solo use.

Security

Auth. Open WebUI's signup-disabled mode + a strong owner password is the floor. Add Caddy's basic-auth on top if you ever expose past Tailscale. Never commit API keys to the workspace the agent has access to.

Sandbox. Run OpenHands in rootless Docker. The agent is non-deterministic — if it ever ships a rm -rf / to a tool call, only the sandbox container should be at risk.

Network exposure. Tailscale + MagicDNS keeps the system on your private mesh. If you must use Cloudflare Tunnel for browser-only access, gate it behind Cloudflare Access (Google SSO is fine for solo use). Never expose vLLM, Qdrant, or Redis directly — they have no auth.

Secrets. Mount Tailscale auth-key, Caddy TLS cert, OpenHands GitHub token via Docker secrets, not env vars. Env vars leak into docker inspect output.

Upgrade path

More users (3-10 humans): swap vLLM → SGLang. RadixAttention's prefix-cache pays for itself the moment 3+ users share a system prompt.

Bigger model (70B): add a second 4090 or 3090 (NVLink optional but useful). Move to vLLM tensor-parallel or ExLlamaV2 for solo throughput. See /stacks/dual-3090-workstation.

Production (multi-tenant, SLA): move vLLM → Ray Serve + vLLM with replicas, add per-user API keys via a thin gateway (Kong / KrakenD), formalize the observability stack with Loki + alerts.

Lower latency (interactive coding): drop quant from AWQ-INT4 → EXL2 5.0bpw on ExLlamaV2 for ~20% throughput uplift on solo decode. Pay back the multi-user wins.

What breaks first

  1. KV cache OOM mid-task. A long agent session at 32K context can OOM the GPU when the agent loops on its own output. Cap OpenHands' max-context-tokens at 24K and clear conversations between tasks.
  2. Docker out-of-disk. OpenHands sandbox containers leak filesystem layers. Run docker system prune -af --volumes weekly or set up a cron job.
  3. Driver / CUDA drift. Ubuntu auto-updates can swap the NVIDIA driver mid-week. Pin the driver version with apt-mark hold nvidia-driver-XXX and only upgrade deliberately.
  4. vLLM upgrade breaks AWQ. vLLM minor versions occasionally drop AWQ kernel compat. Pin the vLLM image SHA in your compose file; only bump after testing.
  5. Tailscale rate-limiting. Free-tier Tailscale caps at 100 devices and ~5 Mbps for solo workloads, fine; if you start syncing 100K-file monorepos through it, you'll hit egress caps.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/local-coding-agent →/stacks/memory-enabled-agent →/stacks/fully-offline-coding-stack →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

evidence only

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 1 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 0 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

  • Evidence
    Cohort: low
    qwen-2.5-coder-32b-instruct via vllm
    • · 2 benchmarks on this triple, not yet reproduced.
    2 benchmarksSubmit a fresh reproduction →
✓EditorialValidate this workflow →See benchmark roadmap →How validation works →
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?