Build a 16GB VRAM local AI stack (May 2026)
A useful local AI workstation on a 16GB VRAM card (RTX 4060 Ti 16GB, RTX 4080 Super, RTX 5070, M-class Apple Silicon with 24GB+ unified memory). Daily-driver quality at the budget tier without trying to pretend it's a 4090.
- 01HardwareReference GPU (the constraint that defines this stack)rtx-4060-ti-16gb
RTX 4060 Ti 16GB is the budget consumer card that justifies its premium specifically for 13-14B class models. ~135W TDP — half a 4090. The architectural anchor: 16GB lets you run 14B class models comfortably, but rules out 32B AWQ (which needs ~22GB).
- 02ToolInference engineollama
Ollama over vLLM at this tier: zero-config setup, fits the single-user pattern, and the Q4_K_M quants Ollama defaults to are exactly what fits 16GB. vLLM's continuous-batching wins don't apply to a single-user box.
- 03ModelPrimary chat / lightweight coding modelphi-4-14b
Phi-4 14B over Qwen 2.5 14B for the 16GB tier: Phi-4 has stronger reasoning per parameter and fits Q4_K_M comfortably (~9.5GB) with KV-cache headroom for 8K context. Qwen 2.5 14B is the alternative when reasoning matters less than coding-specific quality.
- 04ModelFast iteration model (chat + tool calls)qwen-2.5-7b-instruct
Qwen 2.5 7B Q5_K_M for the 'I want a response in 1-2 seconds' workflow. ~60-90 tok/s on a 4060 Ti — fast enough for interactive iteration and tool-call-heavy agent loops at this hardware tier.
- 05ToolUnified frontend (chat + RAG)openwebui
Open WebUI as the multi-model frontend. The model switcher lets you flip between Phi-4 14B (when reasoning matters) and Qwen 2.5 7B (when speed matters) in the same conversation. RAG is functional out of the box.
Why this stack
The 16GB VRAM tier is where most homelab readers actually land — the 4090 / 5090 / MI300X tier is aspirational; the 16GB tier is the budget reality. The honest framing this stack takes: don't pretend a 16GB card can run a 32B model, but understand exactly what it can run, and build the stack to make those models useful.
The architectural anchor: 16GB VRAM fits 14B-class models in Q4_K_M quant with comfortable KV-cache headroom for 8K context. It does NOT comfortably fit 32B AWQ (which needs ~22GB). The two model picks in this stack — Phi-4 14B and Qwen 2.5 7B — give you reasoning quality at 14B and fast iteration at 7B, both well within the VRAM budget.
Step-by-step setup
1. Install Ollama and pull the models
# Native Ollama install — the simplest path at this tier
curl -fsSL https://ollama.com/install.sh | sh
# Primary model — Phi-4 14B for chat + lightweight coding
ollama pull phi4:14b
# Fast iteration model — Qwen 2.5 7B for speed
ollama pull qwen2.5:7b
# Embedding model — for Open WebUI's RAG
ollama pull mxbai-embed-large
# Verify all three loaded
ollama listQ4_K_M quants are Ollama's default for both models — no flag needed. The three pulls together use ~15GB of disk; allow headroom for additional model experiments.
2. Install Open WebUI as the unified frontend
docker run -d --name open-webui \
-p 3000:8080 \
--restart unless-stopped \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL="http://host.docker.internal:11434" \
-e RAG_EMBEDDING_MODEL="mxbai-embed-large" \
ghcr.io/open-webui/open-webui:latestOpen WebUI auto-discovers Ollama's models. After first-run setup at http://localhost:3000, the model switcher shows both Phi-4 14B and Qwen 2.5 7B as siblings. Use the switcher freely mid-conversation — chat history persists across model switches.
3. Configure VRAM management
# Ollama defaults to keep models in VRAM for 5 minutes after last use.
# On 16GB this matters: switching from Phi-4 14B (~9GB) to Qwen 2.5 7B
# (~5GB) will OOM if both stay resident. Set a tight keep-alive:
echo 'OLLAMA_KEEP_ALIVE=60s' | sudo tee -a /etc/systemd/system/ollama.service.d/override.conf
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify only one model resident at a time
ollama ps4. Optional — add RAG via Open WebUI
# Open WebUI's RAG works out of the box once the embedding model is
# pulled. Upload documents via the UI; the system uses ChromaDB by
# default for vector storage.
# To upgrade to LanceDB or Qdrant for >100K chunk corpora, see
# Open WebUI's RAG provider config — but on a 16GB card you'll likely
# never reach the LanceDB ceiling for personal-scale workspaces.What actually fits in 16GB
The honest VRAM math for the 16GB tier:
- Phi-4 14B Q4_K_M — ~9.5GB weights + 1.5GB KV cache (8K) = ~11GB. Fits comfortably. 25-35 tok/s on RTX 4060 Ti.
- Qwen 2.5 14B Q4_K_M — same VRAM footprint as Phi-4 14B. Stronger on coding-specific tasks; weaker on general reasoning.
- Qwen 2.5 7B Q5_K_M — ~5.5GB weights + 1GB KV cache (8K) = ~6.5GB. Fast iteration, headroom for 32K context. 60-90 tok/s.
- Llama 3.1 8B Q5_K_M — ~6GB weights + KV. Comparable to Qwen 2.5 7B; pick by which model produces output you prefer.
- Mistral Nemo 12B Q4_K_M — ~7.5GB. Tight fit with 32K context; the right pick for long-context workloads at this tier.
What does NOT fit:
- Qwen 2.5 Coder 32B AWQ — needs ~22GB. Doesn't fit. Drop to the 14B variant.
- Llama 3.3 70B Q4_K_M — needs ~42GB; would require ~75% CPU offload. Throughput drops to single-digit tok/s. Don't.
- DeepSeek V3 / DeepSeek R1 — frontier-tier models. Not viable on 16GB even with aggressive quants.
Failure modes you'll hit
- OOM when switching models in Open WebUI. Default Ollama keep-alive of 5 minutes leaves the previous model resident. Switching from 14B to 7B causes OOM. Tighten keep-alive (above) or accept manual
ollama stopcalls. - Context-window truncation on large prompts. Ollama defaults to 4K context for some models. Set explicit context via
OLLAMA_NUM_CTX=8192or in Open WebUI workspace settings. - Open WebUI loses connection to Ollama on Linux.
host.docker.internaldoesn't resolve by default; the--add-hostflag (above) fixes it. - Q3_K_M tempts you to fit 32B. Q3_K_M is dramatically lower quality than Q4_K_M. The 32B at Q3 will often lose to 14B at Q4. Resist the temptation.
- Long-context VRAM blow-up. KV cache scales with context. 32K context on Phi-4 14B Q4_K_M = ~6GB KV cache + 9.5GB weights = 15.5GB. Tight on a 16GB card; fits but no headroom.
- Power-supply tripping on integrated systems. RTX 4060 Ti is ~135W under load — well within most PSUs. But if you're running it in an SFF chassis with a 450W PSU also powering a CPU + drives, headroom matters. Verify PSU rating.
Variations and alternatives
Apple Silicon M3 / M3 Pro variation. Apple M3 Pro 18GB unified memory is comparable to a 16GB VRAM card for these models. Swap Ollama for MLX-LM; the rest of the stack is the same. Throughput is ~30% lower than RTX 4060 Ti but power draw is dramatically lower.
RX 7800 XT 16GB variation. AMD's 16GB consumer card. Ollama on ROCm 6.2+ works; throughput ~17% lower than RTX 4060 Ti. Pick AMD when you're committed to the ROCm stack or just prefer AMD.
RAG-heavy variation. If your primary workflow is “chat with my documents,” replace Open WebUI with AnythingLLM. Same Ollama backend; AnythingLLM's workspace + ingestion ergonomics are stronger. See /stacks/offline-rag-workstation for the dedicated path.
Coding-heavy variation. Replace Phi-4 14B with Qwen 2.5 Coder 14B for coding-first workflows. Same VRAM math. For autonomous coding agents, see /stacks/local-coding-agent — but note that recipe assumes 24GB VRAM for the 32B model.
Who should avoid this stack
- Anyone serving multiple concurrent users. Ollama's sequential request handling at this hardware tier means user 2 waits for user 1. Single-user only at this tier; upgrade to vLLM + RTX 4090+ for multi-user.
- Anyone running 32B+ class models. They don't fit. Either upgrade to 24GB+ VRAM or use API calls; don't torture-fit with Q3 quants.
- Anyone running autonomous coding agents. 14B models are at the lower edge of viable for autonomous tasks. The agent will work but quality drops noticeably vs 32B class. See /stacks/local-coding-agent for what 24GB unlocks.
- Anyone whose threat model includes long-running session with consolidated memory. The /stacks/memory- enabled-agent recipe assumes 24GB + DeepSeek Coder V2. The 16GB tier can run memory-enabled but quality drops.
Going deeper
- RTX 4060 Ti 16GB catalog entry — VRAM math, thermal characteristics, the budget-tier reasoning.
- Ollama catalog entry — the runtime characteristics and the keep-alive behavior.
- Open WebUI operational review — the L1.5 review covering provider abstraction and multi-model patterns.
- Inference runtime ecosystem map — full landscape, with the next-tier-up alternatives.
- RTX 4090 workstation stack — what 24GB unlocks vs this 16GB tier.