Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
Mixed-GPU (asymmetric) configuration. Tensor-parallel doesn't work cleanly because TP requires identical cards — your faster card stalls waiting on the slower one every layer. Use llama.cpp's layer-split with manual --tensor-split tuning to distribute layers by VRAM ratio. Effective capacity ~33 GB after layer-split overhead, but the slowest card (22 GB effective) bottlenecks single-tensor operations.
Best engine for this topology + skill level + use case.
Mixed-GPU configurations need llama.cpp's --tensor-split flag with manual ratio tuning by VRAM. vLLM's tensor-parallel requires identical cards and won't run cleanly here.
Inherits llama.cpp's layer-split path with friendlier UX. OLLAMA_GPU_OVERHEAD and per-card env vars do most of what manual flags do.
183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| DeepSeek MoE 16B Base | 16B | Q4_K_M | 20 GB | 4,096 | Fits cleanly at Q4_K_M + 4,096 ctx with 39% headroom. |
| StarCoder 2 15B | 15B | Q4_K_M | 27 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 18% headroom. |
| DeepSeek R1 Distill Qwen 14B | 14B | Q4_K_M | 25.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| Phi-4 Multimodal | 14B | Q4_K_M | 25.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| Qwen 2.5 Coder 14B Instruct | 14B | Q4_K_M | 25.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| Phi-4 Reasoning 14B | 14B | Q4_K_M | 25.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| GLM-4V 9B | 14B | Q4_K_M | 25.8 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| OLMo 2 13B | 13B | Q4_K_M | 17.4 GB | 4,096 | Comfortable fit with 47% headroom — room to extend context or run alongside other workloads. |
| Baichuan 4 13B | 13B | Q4_K_M | 24.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 25% headroom. |
| Stable LM 2 12B | 12B | Q4_K_M | 16.5 GB | 4,096 | Comfortable fit with 50% headroom — room to extend context or run alongside other workloads. |
| Llama 3.2 11B Vision Instruct | 11B | Q8_0 | 27.8 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 16% headroom. |
| Llama 3.2 11B Vision | 11B | Q4_K_M | 22.5 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 32% headroom. |
| Falcon 3 10B | 10B | Q4_K_M | 21.3 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 35% headroom. |
| Gemma 2 9B Instruct | 9B | Q8_0 | 24.5 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 26% headroom. |
| Yi Coder 9B | 9B | Q4_K_M | 20.2 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom. |
| Nemotron 3 Nano 9B | 9B | Q4_K_M | 20.2 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom. |
| GLM-4 9B | 9B | Q4_K_M | 20.2 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom. |
| Tulu 3 8B | 8B | Q4_K_M | 19.1 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Molmo 7B-D | 8B | Q4_K_M | 13 GB | 4,096 | Comfortable fit with 61% headroom — room to extend context or run alongside other workloads. |
| Llama 3.1 Nemotron Nano 8B | 8B | Q4_K_M | 19.1 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Granite 3.0 8B Instruct | 8B | Q4_K_M | 13 GB | 4,096 | Comfortable fit with 61% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Llama 8B | 8B | Q4_K_M | 19.1 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| MiniCPM-V 2.6 8B | 8B | Q4_K_M | 19.1 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| OpenCoder 8B | 8B | Q4_K_M | 19.1 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 32B Instruct | 32B | Q4_K_M | 32.4 GB | 8,192 | Tight fit at Q4_K_M — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DeepSeek V3 Lite (16B MoE) | 16B | Q4_K_M | 28.1 GB | 8,192 | Tight fit at Q4_K_M — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DeepSeek Coder V2 Lite (16B) | 16B | Q4_K_M | 28.1 GB | 8,192 | Tight fit at Q4_K_M — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Granite 3 MoE (3B active) | 16B | Q4_K_M | 28.1 GB | 8,192 | Tight fit at Q4_K_M — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 2.5 14B Instruct | 14B | Q8_0 | 32.6 GB | 8,192 | Tight fit at Q8_0 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Phi-4 14B | 14B | Q8_0 | 32.6 GB | 8,192 | Tight fit at Q8_0 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 14B | 14B | Q8_0 | 32.6 GB | 8,192 | Tight fit at Q8_0 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Pixtral 12B | 12B | Q8_0 | 29.4 GB | 8,192 | Tight fit at Q8_0 — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Gemma 3 12B | 12B | Q8_0 | 29.4 GB | 8,192 | Tight fit at Q8_0 — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Mistral Nemo 12B Instruct | 12B | Q8_0 | 29.4 GB | 8,192 | Tight fit at Q8_0 — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 Embedding 8B | 8B | FP16 | 30.8 GB | 8,192 | Tight fit at FP16 — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| NV-Embed v2 | 8B | FP16 | 30.4 GB | 8,192 | Tight fit at FP16 — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| PaliGemma 2 10B | 10B | BF16 | 36 GB | 8,192 | ~36.0 GB needed at BF16 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| Codestral 22B | 22B | Q4_K_M | 34.9 GB | 8,192 | ~34.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 6%. Drop quant or move to a larger build. |
| Mistral Small 3 24B | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Mistral 24B | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Dolphin 3.0 Mistral 24B | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Mistral Medium 3 24B (dense) | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Mistral Small 3.2 24B | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Devstral Small 2 24B | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Mistral Saba 24B | 24B | Q4_K_M | 37.2 GB | 8,192 | ~37.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Gemma 4 26B MoE | 26B | Q4_K_M | 39.5 GB | 8,192 | ~39.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| InternVL 2.5 26B | 26B | Q4_K_M | 39.5 GB | 8,192 | ~39.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Gemma 3 27B | 27B | Q4_K_M | 40.6 GB | 8,192 | ~40.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 23%. Drop quant or move to a larger build. |
| MedGemma 27B | 27B | Q4_K_M | 40.6 GB | 8,192 | ~40.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 23%. Drop quant or move to a larger build. |
| Qwen 3 30B-A3B | 30B | Q4_K_M | 44 GB | 8,192 | ~44.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 33%. Drop quant or move to a larger build. |
| Nemotron 3 Nano (30B-A3B) | 30B | Q4_K_M | 44 GB | 8,192 | ~44.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 33%. Drop quant or move to a larger build. |
| Gemma 4 31B Dense | 31B | Q4_K_M | 45.1 GB | 8,192 | ~45.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 37%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.