Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
Single AMD Instinct MI300A (APU) — 128 GB VRAM minus ~1.8 GB runtime overhead = ~126 GB usable for weights + KV cache + activations. The 8% headroom we reserve covers the typical OS/driver footprint and gives KV-cache room for an 8K-32K context.
Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.
Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.
Best engine for this topology + skill level + use case.
47 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Phind CodeLlama 34B v2 | 34B | Q4_K_M | 73.9 GB | 16,384 | Comfortable fit with 41% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Coder 32B Instruct | 32B | Q8_0 | 79.1 GB | 32,768 | Fits cleanly at Q8_0 + 32,768 ctx with 37% headroom. |
| Devstral Small 2 24B | 24B | Q4_K_M | 98 GB | 32,768 | Fits cleanly at Q4_K_M + 32,768 ctx with 22% headroom. |
| Codestral 22B | 22B | Q8_0 | 103.3 GB | 32,768 | Fits cleanly at Q8_0 + 32,768 ctx with 18% headroom. |
| DeepSeek V3 Lite (16B MoE) | 16B | Q4_K_M | 76.9 GB | 32,768 | Fits cleanly at Q4_K_M + 32,768 ctx with 39% headroom. |
| DeepSeek Coder V2 Lite (16B) | 16B | Q4_K_M | 76.9 GB | 32,768 | Fits cleanly at Q4_K_M + 32,768 ctx with 39% headroom. |
| StarCoder 2 15B | 15B | Q4_K_M | 42.9 GB | 16,384 | Comfortable fit with 66% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 14B Instruct | 14B | Q8_0 | 78.4 GB | 32,768 | Fits cleanly at Q8_0 + 32,768 ctx with 38% headroom. |
| Qwen 3 14B | 14B | Q8_0 | 78.4 GB | 32,768 | Fits cleanly at Q8_0 + 32,768 ctx with 38% headroom. |
| Qwen 2.5 Coder 14B Instruct | 14B | Q4_K_M | 71.6 GB | 32,768 | Comfortable fit with 43% headroom — room to extend context or run alongside other workloads. |
| Yi Coder 9B | 9B | Q4_K_M | 58.5 GB | 32,768 | Comfortable fit with 54% headroom — room to extend context or run alongside other workloads. |
| OpenCoder 8B | 8B | Q4_K_M | 55.8 GB | 32,768 | Comfortable fit with 56% headroom — room to extend context or run alongside other workloads. |
| Qwen 3 8B | 8B | Q8_0 | 59.7 GB | 32,768 | Comfortable fit with 53% headroom — room to extend context or run alongside other workloads. |
| Llama 3.1 8B Instruct | 8B | FP16 | 55.9 GB | 32,768 | Comfortable fit with 56% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Llama 8B | 8B | Q4_K_M | 55.8 GB | 32,768 | Comfortable fit with 56% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Coder 7B Instruct | 7B | Q6_K | 54.8 GB | 32,768 | Comfortable fit with 56% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 7B Instruct | 7B | Q8_0 | 44.5 GB | 32,768 | Comfortable fit with 65% headroom — room to extend context or run alongside other workloads. |
| CodeGemma 7B | 7B | Q4_K_M | 18.1 GB | 8,192 | Comfortable fit with 86% headroom — room to extend context or run alongside other workloads. |
| Codestral Mamba 7B | 7B | Q4_K_M | 53.2 GB | 32,768 | Comfortable fit with 58% headroom — room to extend context or run alongside other workloads. |
| CodeQwen 1.5 7B | 7B | Q4_K_M | 53.2 GB | 32,768 | Comfortable fit with 58% headroom — room to extend context or run alongside other workloads. |
| StarCoder 2 7B | 7B | Q4_K_M | 29.8 GB | 16,384 | Comfortable fit with 76% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Coder 3B | 3B | Q4_K_M | 42.7 GB | 32,768 | Comfortable fit with 66% headroom — room to extend context or run alongside other workloads. |
| StarCoder 2 3B | 3B | Q4_K_M | 23.3 GB | 16,384 | Comfortable fit with 82% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Coder 1.5B | 2B | Q4_K_M | 38.7 GB | 32,768 | Comfortable fit with 69% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Llama 3.3 70B Instruct | 70B | Q8_0 | 123.6 GB | 32,768 | Tight fit at Q8_0 — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3.6 35B-A3B (MTP) | 35B | Q3_K_M | 122.7 GB | 32,768 | Tight fit at Q3_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 2.5 32B Instruct | 32B | Q4_K_M | 119.1 GB | 32,768 | Tight fit at Q4_K_M — only 6% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 32B | 32B | Q5_K_M | 121.9 GB | 32,768 | Tight fit at Q5_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Gemma 4 31B Dense | 31B | Q4_K_M | 116.4 GB | 32,768 | Tight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 30B-A3B | 30B | Q4_K_M | 113.8 GB | 32,768 | Tight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3.6 27B (MTP) | 27B | Q8_0 | 118.9 GB | 32,768 | Tight fit at Q8_0 — only 6% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Mistral Small 3 24B | 24B | Q8_0 | 109.5 GB | 32,768 | Tight fit at Q8_0 — only 13% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| DeepSeek R1 Distill Qwen 3 32B | 32B | AWQ-INT4 | 132.4 GB | 32,768 | ~132.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| Qwen 3 Coder 32B | 32B | AWQ-INT4 | 132.4 GB | 32,768 | ~132.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| DeepSeek Coder V3 | 33B | AWQ-INT4 | 135.4 GB | 32,768 | ~135.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 7%. Drop quant or move to a larger build. |
| Llama 4 70B | 70B | AWQ-INT4 | 248.3 GB | 32,768 | ~248.3 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 97%. Drop quant or move to a larger build. |
| Llama 3.1 70B Instruct | 70B | Q4_K_M | 219.1 GB | 32,768 | ~219.1 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 74%. Drop quant or move to a larger build. |
| Qwen 3 72B | 72B | AWQ-INT4 | 254.4 GB | 32,768 | ~254.4 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 102%. Drop quant or move to a larger build. |
| Llama 4 Scout | 109B | Q4_K_M | 321.9 GB | 32,768 | ~321.9 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 155%. Drop quant or move to a larger build. |
| Qwen 3 235B-A22B | 235B | Q4_K_M | 653.7 GB | 32,768 | ~653.7 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 419%. Drop quant or move to a larger build. |
| DeepSeek Coder V2 236B | 236B | Q4_K_M | 656.4 GB | 32,768 | ~656.4 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 421%. Drop quant or move to a larger build. |
| DeepSeek V2.5 236B | 236B | Q4_K_M | 656.4 GB | 32,768 | ~656.4 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 421%. Drop quant or move to a larger build. |
| DeepSeek V4 Flash (284B MoE) | 284B | Q4_K_M | 782.8 GB | 32,768 | ~782.8 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 521%. Drop quant or move to a larger build. |
| DeepSeek V4 | 745B | AWQ-INT4 | 2307 GB | 32,768 | ~2307.0 GB needed at AWQ-INT4 + 32,768 ctx — overshoots effective VRAM by 1731%. Drop quant or move to a larger build. |
| Kimi K2.6 | 1000B | Q4_K_M | 2668.7 GB | 32,768 | ~2668.7 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 2018%. Drop quant or move to a larger build. |
| Ring-2.6-1T | 1000B | Q3_K_M | 2546.6 GB | 32,768 | ~2546.6 GB needed at Q3_K_M + 32,768 ctx — overshoots effective VRAM by 1921%. Drop quant or move to a larger build. |
| DeepSeek V4 Pro (1.6T MoE) | 1600B | Q4_K_M | 4249.1 GB | 32,768 | ~4249.1 GB needed at Q4_K_M + 32,768 ctx — overshoots effective VRAM by 3272%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.