Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
4× NVIDIA GeForce RTX 3090 = 96 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/4 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~76 GB. Largest single tensor on one card is ~22 GB.
Best engine for this topology + skill level + use case.
Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.
Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.
Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.
183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Llama 3.3 70B Instruct | 70B | Q5_K_M | 63.2 GB | 8,192 | Fits cleanly at Q5_K_M + 8,192 ctx with 17% headroom. |
| Aya 23 35B | 35B | Q4_K_M | 49.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 35% headroom. |
| Command R 35B | 35B | Q4_K_M | 49.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 35% headroom. |
| Phind CodeLlama 34B v2 | 34B | Q4_K_M | 48.5 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom. |
| Yi 1.5 34B | 34B | Q4_K_M | 48.5 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom. |
| DeepSeek Coder V3 | 33B | AWQ-INT4 | 61.1 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 20% headroom. |
| Qwen 2.5 32B Instruct | 32B | Q8_0 | 61.7 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 19% headroom. |
| DeepSeek R1 Distill Qwen 3 32B | 32B | AWQ-INT4 | 59.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom. |
| QwQ 32B Preview | 32B | Q4_K_M | 46.3 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom. |
| EXAONE 3.5 32B | 32B | AWQ-INT4 | 59.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom. |
| OLMo 2 32B | 32B | Q4_K_M | 46.3 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom. |
| Aya Expanse 32B | 32B | AWQ-INT4 | 59.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom. |
| Qwen 2.5 Coder 32B Instruct | 32B | Q8_0 | 47.8 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 37% headroom. |
| Qwen 3 32B | 32B | Q8_0 | 61.7 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 19% headroom. |
| Magistral 32B | 32B | AWQ-INT4 | 59.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom. |
| DeepSeek R1 Distill Qwen 32B | 32B | Q8_0 | 61.7 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 19% headroom. |
| Qwen 3 Coder 32B | 32B | AWQ-INT4 | 59.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom. |
| Gemma 4 31B Dense | 31B | Q8_0 | 60.1 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 21% headroom. |
| Qwen 3 30B-A3B | 30B | Q8_0 | 58.5 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 23% headroom. |
| Nemotron 3 Nano (30B-A3B) | 30B | Q8_0 | 58.5 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 23% headroom. |
| Gemma 3 27B | 27B | Q8_0 | 53.6 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 29% headroom. |
| MedGemma 27B | 27B | Q4_K_M | 40.6 GB | 8,192 | Comfortable fit with 47% headroom — room to extend context or run alongside other workloads. |
| Gemma 4 26B MoE | 26B | Q4_K_M | 39.5 GB | 8,192 | Comfortable fit with 48% headroom — room to extend context or run alongside other workloads. |
| InternVL 2.5 26B | 26B | Q4_K_M | 39.5 GB | 8,192 | Comfortable fit with 48% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Qwen 2.5 Math 72B | 72B | Q4_K_M | 69.5 GB | 4,096 | Tight fit at Q4_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Molmo 72B | 72B | Q4_K_M | 69.5 GB | 4,096 | Tight fit at Q4_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Jamba 1.5 Mini | 52B | Q4_K_M | 69 GB | 8,192 | Tight fit at Q4_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Mixtral 8x7B Instruct | 47B | Q5_K_M | 67.4 GB | 8,192 | Tight fit at Q5_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Nemotron 3 Super 49B | 49B | AWQ-INT4 | 85.9 GB | 8,192 | ~85.9 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| Hermes 3 Llama 3.1 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | ~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build. |
| Hermes 4 Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | ~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Llama 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | ~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build. |
| Dolphin 3 Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | ~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build. |
| EVA Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | ~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build. |
| Llama 4 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | ~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build. |
| Llama 3.1 Nemotron 70B Instruct | 70B | Q4_K_M | 89.4 GB | 8,192 | ~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build. |
| Llama 3.1 70B Instruct | 70B | Q4_K_M | 89.4 GB | 8,192 | ~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build. |
| OpenBioLLM Llama 3 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | ~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build. |
| Tulu 3 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | ~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build. |
| Qwen 3 72B | 72B | AWQ-INT4 | 121.6 GB | 8,192 | ~121.6 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 60%. Drop quant or move to a larger build. |
| Qwen 2.5 72B Instruct | 72B | Q4_K_M | 91.6 GB | 8,192 | ~91.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 21%. Drop quant or move to a larger build. |
| Qwen 2.5-VL 72B | 72B | AWQ-INT4 | 121.6 GB | 8,192 | ~121.6 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 60%. Drop quant or move to a larger build. |
| InternVL 2.5 78B | 78B | Q4_K_M | 98.4 GB | 8,192 | ~98.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 30%. Drop quant or move to a larger build. |
| Llama 3.2 90B Vision Instruct | 90B | Q4_K_M | 112 GB | 8,192 | ~112.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 47%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.