Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
Single NVIDIA GeForce RTX 5060 Ti 16GB — 16 GB VRAM minus ~1.5 GB runtime overhead = ~14 GB usable for weights + KV cache + activations. The 8% headroom we reserve covers the typical OS/driver footprint and gives KV-cache room for an 8K-32K context.
Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.
Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.
Best engine for this topology + skill level + use case.
AWQ-INT4 path fits 32B-class models on a 24 GB card with concurrent users. The production-default for self-hosted coding agents and multi-user serving.
Single-stream throughput king on consumer NVIDIA. EXL2 4.65bpw on a 4090 hits the highest tok/s in this class.
44 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.
No model fits comfortably on this build.
No borderline models — clean fit ladder.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 1.5B | 2B | Q4_K_M | 38.5 GB | 32,768 | ~38.5 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build. |
| Qwen 2.5 Coder 3B | 3B | Q4_K_M | 42.5 GB | 32,768 | ~42.5 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 183%. Drop quant or move to a larger build. |
| StarCoder 2 3B | 3B | Q4_K_M | 23.1 GB | 16,384 | ~23.1 GB needed at Q4_K_M + 16,384 ctx — overshoots effective VRAM by 54%. Drop quant or move to a larger build. |
| CodeGemma 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Qwen 2.5 Coder 7B Instruct | 7B | Q4_K_M | 53 GB | 32,768 | ~53.0 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 253%. Drop quant or move to a larger build. |
| Qwen 2.5 7B Instruct | 7B | Q4_K_M | 40.9 GB | 32,768 | ~40.9 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 173%. Drop quant or move to a larger build. |
| Codestral Mamba 7B | 7B | Q4_K_M | 53 GB | 32,768 | ~53.0 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 253%. Drop quant or move to a larger build. |
| StarCoder 2 7B | 7B | Q4_K_M | 29.6 GB | 16,384 | ~29.6 GB needed at Q4_K_M + 16,384 ctx — overshoots effective VRAM by 97%. Drop quant or move to a larger build. |
| CodeQwen 1.5 7B | 7B | Q4_K_M | 53 GB | 32,768 | ~53.0 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 253%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Llama 8B | 8B | Q4_K_M | 55.6 GB | 32,768 | ~55.6 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 271%. Drop quant or move to a larger build. |
| OpenCoder 8B | 8B | Q4_K_M | 55.6 GB | 32,768 | ~55.6 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 271%. Drop quant or move to a larger build. |
| Qwen 3 8B | 8B | Q4_K_M | 55.6 GB | 32,768 | ~55.6 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 271%. Drop quant or move to a larger build. |
| Llama 3.1 8B Instruct | 8B | Q4_K_M | 43.9 GB | 32,768 | ~43.9 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 193%. Drop quant or move to a larger build. |
| Yi Coder 9B | 9B | Q4_K_M | 58.3 GB | 32,768 | ~58.3 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 288%. Drop quant or move to a larger build. |
| Qwen 2.5 14B Instruct | 14B | Q4_K_M | 71.4 GB | 32,768 | ~71.4 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 376%. Drop quant or move to a larger build. |
| Qwen 2.5 Coder 14B Instruct | 14B | Q4_K_M | 71.4 GB | 32,768 | ~71.4 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 376%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.