Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
8× NVIDIA GeForce RTX 4090 = 192 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/8 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~153 GB. Largest single tensor on one card is ~22 GB.
Best engine for this topology + skill level + use case.
Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.
Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.
Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.
183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Command R+ 104B | 104B | Q4_K_M | 127.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 16% headroom. |
| Llama 3.2 90B Vision Instruct | 90B | Q4_K_M | 112 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 27% headroom. |
| InternVL 2.5 78B | 78B | Q4_K_M | 98.4 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom. |
| Qwen 2.5 Math 72B | 72B | Q4_K_M | 69.5 GB | 4,096 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| Qwen 3 72B | 72B | AWQ-INT4 | 121.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 21% headroom. |
| Qwen 2.5 72B Instruct | 72B | Q5_K_M | 98 GB | 8,192 | Fits cleanly at Q5_K_M + 8,192 ctx with 36% headroom. |
| Molmo 72B | 72B | Q4_K_M | 69.5 GB | 4,096 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5-VL 72B | 72B | AWQ-INT4 | 121.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 21% headroom. |
| Hermes 3 Llama 3.1 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Hermes 4 Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom. |
| Llama 3.3 70B Instruct | 70B | Q8_0 | 90.8 GB | 8,192 | Comfortable fit with 41% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Llama 70B | 70B | Q5_K_M | 95.5 GB | 8,192 | Fits cleanly at Q5_K_M + 8,192 ctx with 38% headroom. |
| Dolphin 3 Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom. |
| EVA Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom. |
| Llama 4 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom. |
| Llama 3.1 Nemotron 70B Instruct | 70B | Q4_K_M | 89.4 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Llama 3.1 70B Instruct | 70B | Q5_K_M | 95.5 GB | 8,192 | Fits cleanly at Q5_K_M + 8,192 ctx with 38% headroom. |
| OpenBioLLM Llama 3 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Tulu 3 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Jamba 1.5 Mini | 52B | Q4_K_M | 69 GB | 8,192 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| Nemotron 3 Super 49B | 49B | AWQ-INT4 | 85.9 GB | 8,192 | Comfortable fit with 44% headroom — room to extend context or run alongside other workloads. |
| Mixtral 8x7B Instruct | 47B | Q5_K_M | 67.4 GB | 8,192 | Comfortable fit with 56% headroom — room to extend context or run alongside other workloads. |
| Aya 23 35B | 35B | Q4_K_M | 49.7 GB | 8,192 | Comfortable fit with 68% headroom — room to extend context or run alongside other workloads. |
| Command R 35B | 35B | Q4_K_M | 49.7 GB | 8,192 | Comfortable fit with 68% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Mistral Large 2 (123B) | 123B | Q4_K_M | 149.5 GB | 8,192 | Tight fit at Q4_K_M — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Nemotron 3 Super (120B-A12B) | 120B | Q4_K_M | 146.1 GB | 8,192 | Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Llama 4 Scout | 109B | Q5_K_M | 143.2 GB | 8,192 | Tight fit at Q5_K_M — only 6% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Llama 3.2 90B Vision | 90B | AWQ-INT4 | 149.5 GB | 8,192 | Tight fit at AWQ-INT4 — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Command R+ (Aug 2024) | 104B | AWQ-INT4 | 171.2 GB | 8,192 | ~171.2 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 12%. Drop quant or move to a larger build. |
| DBRX Base | 132B | Q4_K_M | 159.7 GB | 8,192 | ~159.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build. |
| DBRX Instruct | 132B | AWQ-INT4 | 214.6 GB | 8,192 | ~214.6 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 40%. Drop quant or move to a larger build. |
| Mixtral 8x22B Instruct | 141B | Q4_K_M | 169.9 GB | 8,192 | ~169.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 11%. Drop quant or move to a larger build. |
| WizardLM-2 8x22B | 141B | Q4_K_M | 169.9 GB | 8,192 | ~169.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 11%. Drop quant or move to a larger build. |
| GLM-5 Pro | 144B | AWQ-INT4 | 233.2 GB | 8,192 | ~233.2 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 52%. Drop quant or move to a larger build. |
| GLM-5 | 200B | Q4_K_M | 236.8 GB | 8,192 | ~236.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 55%. Drop quant or move to a larger build. |
| Kimi K1.5 | 200B | AWQ-INT4 | 320 GB | 8,192 | ~320.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 109%. Drop quant or move to a larger build. |
| Qwen 3 235B-A22B | 235B | Q4_K_M | 276.5 GB | 8,192 | ~276.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 81%. Drop quant or move to a larger build. |
| DeepSeek Coder V2 236B | 236B | Q4_K_M | 277.6 GB | 8,192 | ~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 81%. Drop quant or move to a larger build. |
| DeepSeek V2.5 236B | 236B | Q4_K_M | 277.6 GB | 8,192 | ~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 81%. Drop quant or move to a larger build. |
| Llama 3.1 Nemotron Ultra 253B | 253B | Q4_K_M | 296.9 GB | 8,192 | ~296.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 94%. Drop quant or move to a larger build. |
| DeepSeek V4 Flash (284B MoE) | 284B | Q4_K_M | 332 GB | 8,192 | ~332.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 117%. Drop quant or move to a larger build. |
| Hunyuan Large 389B MoE | 389B | Q4_K_M | 451.1 GB | 8,192 | ~451.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 195%. Drop quant or move to a larger build. |
| Qwen 3.5 235B-A17B (MoE) | 397B | Q4_K_M | 460.2 GB | 8,192 | ~460.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 201%. Drop quant or move to a larger build. |
| Jamba 1.5 Large | 398B | Q4_K_M | 461.3 GB | 8,192 | ~461.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 202%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.