Operations guide · Anti-patterns

Common local AI setup mistakes — the operator's don't-do-this list

10 mistakes that consume the most operator time and produce the most r/LocalLLaMA confusion threads. From sizing 70B onto a 12 GB card to assuming NVLink pools memory, with the actual fix for each.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,900 words

How to read this list

These ten mistakes account for an outsized fraction of every “why is my local model so slow / broken / bad?” thread on r/LocalLLaMA and r/LocalLLM. They're not exotic. They're mostly the result of taking a default that worked on someone else's hardware and assuming it works on yours, or skipping the math because the runtime appeared to start without an error.

The fix in every case is short. The cost of not applying the fix is hours of debugging that produces no insight, plus a vague feeling that “local AI is just bad.” If you're currently in the middle of a frustrating setup, scan this list — there's a 70% chance one of these is the answer.

1. Trying to run 70B on a 12-16 GB card

The mistake: downloading a 70B Q4 model (~40 GB on disk) onto a card with 12-16 GB VRAM. The runtime starts. The model partially loads. It silently spills the rest into system RAM via CPU offload. You see 0.5-2 tok/s and assume the model is broken.

The reality: the model is fine. Your hardware is too small. CPU-offloaded portions of large models are 30-100× slower than the GPU portion. The math is simple: a 70B model needs roughly 38-42 GB at Q4 just for weights, plus 2-8 GB for KV cache at moderate context. There is no way to fit this on 12-16 GB without offload.

The fix: at 12-16 GB, run 14B-class models. At 24 GB, 70B Q4 fits. At 32 GB+ (RTX 5090) or unified-memory Apple, 70B Q5+ becomes comfortable. Use /will-it-run/custom to verify before downloading.

2. Using Q2_K or Q3 quants for coding tasks

The mistake: picking the smallest quant that “fits” — say, a 32B Q2_K to squeeze onto a 12 GB card. The output is incoherent on hard prompts, mixes up variable names, hallucinates API signatures. You blame the model.

The reality: Q2 quants drop quality non-linearly on reasoning and code. The cliff is steep — Q4_K_M loses 2-3% on benchmarks vs FP16; Q2_K can lose 15-20%+. For chat, the loss is sometimes tolerable. For code generation, structured output, and tool calling, Q2 is rarely usable. See /systems/quantization-formats for the full quality-vs-VRAM matrix.

The fix: stay at Q4_K_M or higher whenever VRAM allows. If you're tempted to drop to Q2 to fit a bigger model, pick a smaller model at Q4 instead. A 14B Q4_K_M almost always beats a 32B Q2_K on real coding tasks.

3. Ignoring KV cache memory budget

The mistake: sizing your model to fill 95% of VRAM, then enabling a 32K-token context window. The first long prompt OOMs. You don't understand why — the model fit yesterday on the same card.

The reality: the KV cache grows with context length. For a 14B model at 32K context with FP16 KV, you need ~3-5 GB just for the cache, on top of the ~8-9 GB for the model weights. Long context isn't free. Tools rarely warn you about this in advance.

The fix: budget VRAM as weights + KV cache + headroom. KV cache size scales with batch size × sequence length × hidden dim × layers × 2 (for K and V). Use /will-it-run/custom's effective-VRAM math, or enable Q4/Q8 KV cache compression in llama.cpp / vLLM if your runtime supports it.

4. Assuming NVLink pools VRAM

The mistake: buying two RTX 3090s, plugging in an NVLink bridge, and expecting Ollama to see them as a single 48 GB GPU. Loading a 70B Q5 model fails with OOM on each card individually.

The reality: NVLink is a high-bandwidth interconnect, not a memory pool. The two cards remain logically separate to the runtime. To use combined VRAM, you need software that explicitly supports tensor-parallelism: vLLM with --tensor-parallel-size 2, ExLlamaV2 with TabbyAPI's split mode, or llama.cpp's --split-mode with explicit GPU layer assignment. NVLink only accelerates the communication between them; it doesn't change the topology.

The fix: read running local AI on multiple GPUs. Pick a runtime that supports tensor-parallel. NVLink is a 10-20% speedup on top of working multi-GPU; it is not the thing that makes multi-GPU work.

5. Letting Docker fill the disk

The mistake: running Open WebUI + Ollama in Docker for six months. One day inference fails with cryptic errors. df -h shows the disk is 100% full. Investigation reveals 200 GB of orphaned Docker layers from agent sandboxes and image rebuilds.

The reality: Docker's overlay2 storage driver accumulates layers from image pulls, container exits, and volume churn. A long-running setup with periodic image bumps and ephemeral agent containers (OpenHands, etc.) easily generates 50-200 GB of stale data per year. There is no automatic GC.

The fix: docker system prune -af --volumes as a weekly cron. Pin Docker images by SHA digest, not :latest, so re-pulls don't leave abandoned layers. See /systems/local-ai-maintenance.

6. Running Ollama on Windows + AMD without ROCm

The mistake: installing Ollama on Windows with an AMD card. The runtime starts. Inference works but at CPU-tier tok/s. You assume your 7900 XTX is “just slow.”

The reality: ROCm support on Windows in 2026 has improved significantly but is still patchy. Some Ollama Windows builds default to a CPU or Vulkan path on AMD because ROCm Windows isn't fully detected. The card works fine; the runtime isn't using it correctly.

The fix: verify the runtime is actually using ROCm by checking ollama serve startup logs for “ROCm” or “HIP” references. If absent, install the ROCm Windows components explicitly, or move to Linux (where AMD support is much stronger). See /errors for the AMD-specific failure modes. The honest answer for AMD users in 2026: Linux is significantly easier than Windows.

7. Not pinning your driver / CUDA / runtime versions

The mistake: letting Linux apt auto-upgrade your NVIDIA driver, or letting Windows Update bump the WSL2 kernel. A previously-working vLLM or ExLlamaV2 setup fails to load with cryptic libcuda.so version mismatch errors.

The reality: the driver, CUDA toolkit, and inference engine all have ABI compatibility constraints. When the driver moves and the engine doesn't, things break. The opposite (engine moves, driver stays) usually works because of forward compatibility, but the driver-moves case routinely breaks builds.

The fix: apt-mark hold nvidia-driver-XXX on Ubuntu. Defer Windows feature updates 90 days. Pin Docker images by SHA. Bump the whole stack deliberately on a quiet weekend, not by accident at 3 AM. See /systems/local-ai-maintenance for the full driver-pinning workflow.

8. Mixing FP16 and quantized weights without checking

The mistake: downloading a quantized model from one repo and trying to load its tokenizer or chat template from another, or using a fine-tune's LoRA adapter against a base model that's a different quantization than the LoRA was trained on. Output is garbled or off-topic.

The reality: the model card, tokenizer config, and quantization need to match. A LoRA trained on FP16 weights doesn't apply cleanly to a Q4 base; tokenizer mismatches between repos can produce subtle but real degradation. New AWQ checkpoints sometimes ship with subtly different chat templates than their FP16 sources, breaking tool calling.

The fix: use the model card from the same repo as the weights you downloaded. If using LoRA, check that the adapter's training-time quantization matches your runtime quantization. Test with a simple known prompt before assuming the model is broken.

9. Enabling 128K context on hardware that can't hold it

The mistake: the model card says “128K context window!” You set num_ctx 131072 in Ollama. The runtime allocates 30+ GB for the KV cache, which doesn't fit. Inference fails with OOM, or worse, succeeds but at 1-2 tok/s because the KV cache is paged.

The reality: the model architecture supports 128K context. Your hardware probably doesn't. KV cache for 128K on a 70B model can exceed 40 GB at FP16 — larger than the model itself. Most consumer setups should run 8-32K context, not the model's nominal max.

The fix: set context length to what fits comfortably, not what the model card lists. Use Q4 or Q8 KV cache compression to halve or quarter the cost. Measure: nvidia-smi right after loading shows you the real footprint.

10. Trusting tok/s benchmarks without sample size

The mistake: reading a single Reddit post saying “I get 75 tok/s on Llama 3 70B Q4 with my 4090” and assuming you'll get the same. You buy the 4090. You get 30 tok/s. You feel deceived.

The reality: single-number benchmarks are almost always best-case. They depend on prompt length, batch size, sampling temperature, KV cache settings, runtime version, driver version, OS, ambient temperature, and whether anything else is using the GPU. Real-world tok/s is typically a 30-50% range, not a single number.

The fix: trust ranges, not single benchmarks. Cross-check at least three sources for any tok/s claim before making a buying decision. Run your own benchmark on your own prompts before declaring a build a success or a failure. /hardware/rtx-3090 and /hardware/rtx-4090 publish ranges, not single numbers, for this reason.

What actually goes right when you avoid these

Operators who avoid these ten mistakes report dramatically smoother experiences. The pattern that works:

  • Verify hardware fit at /will-it-run/custom before downloading.
  • Stay at Q4_K_M or higher quantization.
  • Pin driver, CUDA, and runtime versions deliberately.
  • Run docker system prune weekly if using Docker.
  • Set context length to what fits, not what the model claims.
  • Read multiple data points before believing any single benchmark.

That checklist removes the bottom 90% of operator pain. The remaining 10% — driver bumps, ROCm cycles, thermal creep — is covered in /systems/local-ai-maintenance.

Adjacent reading: /errors for the full taxonomy of specific error messages and their fixes; can I run AI locally? for the foundation if you're still pre-purchase; the hardware buying ladder if you're upgrading.