Local coding-agent system
End-to-end local autonomous coding agent. vLLM serving Qwen 2.5 Coder 32B, OpenHands as the agent controller, Open WebUI for chat, Qdrant + nomic-embed for code RAG, bge-reranker for retrieval, Redis for the agent queue, Docker sandbox for code execution, Caddy reverse proxy, Tailscale for remote access. The whole system on one workstation.
Build summary
Goal: An autonomous coding agent that learns the codebase, executes multi-file edits, runs tests, and never sends a token to a cloud.
Operator card
- ✓Solo developer who wants an autonomous coding agent
- ✓Privacy-sensitive teams that can't ship code to cloud LLMs
- ✓Engineers who want to learn how vLLM + agent loops actually run
- ✓Homelab operators with a single 4090 already
- ⚠You need >2 concurrent users (move to SGLang or production tier)
- ⚠Your codebase is >10M LoC (the embeddings index becomes the bottleneck)
- ⚠You can't dedicate a workstation 24/7 to this
- ⚠Apple Silicon — see /workflows/private-chatgpt-replacement instead
Service ledger
11 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
RTX 4090 24 GB is the floor. The 32B AWQ-INT4 model takes 16 GB; KV cache for 32K context takes ~3-4 GB; OS + driver overhead ~2 GB. Headroom for the embeddings model (300 MB) and the dcgm-exporter footprint is comfortable.
CPU should be modern high-end (Ryzen 7 7950X or i7-14700K class). 64 GB DDR5 is the practical floor — Qdrant's HNSW indices, Redis, the Docker daemon, and the OS all want RAM. NVMe Gen4 because Qdrant snapshots and OpenHands sandbox spinup hit storage hard.
PSU must be Gold-rated 1000 W minimum. Transient spikes on a 4090 hit 600 W+ for milliseconds; cheaper PSUs trip OCP under sustained inference + agent loop bursts.
Storage
Plan ~150 GB just for the model weights and runtime images. The vector DB grows ~1 KB per code chunk × ~10K chunks per repo = ~10 MB per repo, but the model snapshots and OpenHands workspaces dominate.
Embeddings re-ingestion is the SSD-wear surface. A daily full re-ingest of a 100K-file monorepo writes ~1-3 GB. Consumer NVMe drives are rated 600-1200 TBW; you won't hit that in years of normal use, but plan for incremental ingestion (watcher-based) instead of nightly full rebuilds.
Back up Qdrant volumes + OpenHands state nightly. docker run --rm -v qdrant_data:/data -v $(pwd)/backup:/backup alpine tar czf /backup/qdrant-$(date +%F).tgz /data is the lazy operator pattern.
Networking
Bind every service to localhost EXCEPT Caddy on 443 and Tailscale's interface. The agent's Docker sandbox containers should run with --network=none unless they explicitly need network egress (e.g. a tool that hits a stub HTTP API in tests).
Open WebUI behind Caddy + Tailscale is the recommended remote-access pattern — never expose Open WebUI directly to the public internet. The auth surface inside Open WebUI is fine for a single user but does not stand up to internet-facing scrutiny.
Inside Tailscale: enable MagicDNS so you can hit workstation/openwebui from your laptop without remembering the tailnet IP.
Observability
vLLM's Prometheus endpoint is the single most important metric source. Watch:
vllm:e2e_request_latency_seconds(p99)vllm:gpu_cache_usage_perc(KV-cache pressure)vllm:num_requests_running(concurrency)DCGM_FI_DEV_GPU_TEMP(sustained ≥ 80 °C means more airflow needed)DCGM_FI_DEV_POWER_USAGE(sustained ≥ 420 W means PSU under stress)
Set Grafana alerts on (a) GPU temp ≥ 84 °C for ≥ 5 min, (b) KV-cache utilization ≥ 90 % for ≥ 60 s (context likely OOM-bound), (c) OpenHands queue depth ≥ 5 (agent loop falling behind).
Logs: Open WebUI and OpenHands log to stdout; pipe through Loki if you want grep-able history. Without Loki, Docker's default 10 MB / file × 3 files cap is enough for solo use.
Security
Auth. Open WebUI's signup-disabled mode + a strong owner password is the floor. Add Caddy's basic-auth on top if you ever expose past Tailscale. Never commit API keys to the workspace the agent has access to.
Sandbox. Run OpenHands in rootless Docker. The agent is non-deterministic — if it ever ships a rm -rf / to a tool call, only the sandbox container should be at risk.
Network exposure. Tailscale + MagicDNS keeps the system on your private mesh. If you must use Cloudflare Tunnel for browser-only access, gate it behind Cloudflare Access (Google SSO is fine for solo use). Never expose vLLM, Qdrant, or Redis directly — they have no auth.
Secrets. Mount Tailscale auth-key, Caddy TLS cert, OpenHands GitHub token via Docker secrets, not env vars. Env vars leak into docker inspect output.
Upgrade path
More users (3-10 humans): swap vLLM → SGLang. RadixAttention's prefix-cache pays for itself the moment 3+ users share a system prompt.
Bigger model (70B): add a second 4090 or 3090 (NVLink optional but useful). Move to vLLM tensor-parallel or ExLlamaV2 for solo throughput. See /stacks/dual-3090-workstation.
Production (multi-tenant, SLA): move vLLM → Ray Serve + vLLM with replicas, add per-user API keys via a thin gateway (Kong / KrakenD), formalize the observability stack with Loki + alerts.
Lower latency (interactive coding): drop quant from AWQ-INT4 → EXL2 5.0bpw on ExLlamaV2 for ~20% throughput uplift on solo decode. Pay back the multi-user wins.
What breaks first
- KV cache OOM mid-task. A long agent session at 32K context can OOM the GPU when the agent loops on its own output. Cap OpenHands' max-context-tokens at 24K and clear conversations between tasks.
- Docker out-of-disk. OpenHands sandbox containers leak filesystem layers. Run
docker system prune -af --volumesweekly or set up a cron job. - Driver / CUDA drift. Ubuntu auto-updates can swap the NVIDIA driver mid-week. Pin the driver version with
apt-mark hold nvidia-driver-XXXand only upgrade deliberately. - vLLM upgrade breaks AWQ. vLLM minor versions occasionally drop AWQ kernel compat. Pin the vLLM image SHA in your compose file; only bump after testing.
- Tailscale rate-limiting. Free-tier Tailscale caps at 100 devices and ~5 Mbps for solo workloads, fine; if you start syncing 100K-file monorepos through it, you'll hit egress caps.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Workflow validation
Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 1 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 0 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.
- · 2 benchmarks on this triple, not yet reproduced.
2 benchmarksSubmit a fresh reproduction →