Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
4× NVIDIA H100 SXM = 320 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/4 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~266 GB. Largest single tensor on one card is ~78 GB.
Best engine for this topology + skill level + use case.
Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.
Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.
Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.
183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Mixtral 8x22B Instruct | 141B | Q4_K_M | 169.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom. |
| WizardLM-2 8x22B | 141B | Q4_K_M | 169.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom. |
| DBRX Base | 132B | Q4_K_M | 159.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 40% headroom. |
| DBRX Instruct | 132B | AWQ-INT4 | 214.6 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 19% headroom. |
| Mistral Large 2 (123B) | 123B | Q4_K_M | 149.5 GB | 8,192 | Comfortable fit with 44% headroom — room to extend context or run alongside other workloads. |
| Nemotron 3 Super (120B-A12B) | 120B | Q4_K_M | 146.1 GB | 8,192 | Comfortable fit with 45% headroom — room to extend context or run alongside other workloads. |
| Llama 4 Scout | 109B | Q5_K_M | 143.2 GB | 8,192 | Comfortable fit with 46% headroom — room to extend context or run alongside other workloads. |
| Command R+ (Aug 2024) | 104B | AWQ-INT4 | 171.2 GB | 8,192 | Fits cleanly at AWQ-INT4 + 8,192 ctx with 36% headroom. |
| Command R+ 104B | 104B | Q4_K_M | 127.9 GB | 8,192 | Comfortable fit with 52% headroom — room to extend context or run alongside other workloads. |
| Llama 3.2 90B Vision Instruct | 90B | Q4_K_M | 112 GB | 8,192 | Comfortable fit with 58% headroom — room to extend context or run alongside other workloads. |
| Llama 3.2 90B Vision | 90B | AWQ-INT4 | 149.5 GB | 8,192 | Comfortable fit with 44% headroom — room to extend context or run alongside other workloads. |
| InternVL 2.5 78B | 78B | Q4_K_M | 98.4 GB | 8,192 | Comfortable fit with 63% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Math 72B | 72B | Q4_K_M | 69.5 GB | 4,096 | Comfortable fit with 74% headroom — room to extend context or run alongside other workloads. |
| Qwen 3 72B | 72B | AWQ-INT4 | 121.6 GB | 8,192 | Comfortable fit with 54% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 72B Instruct | 72B | Q5_K_M | 98 GB | 8,192 | Comfortable fit with 63% headroom — room to extend context or run alongside other workloads. |
| Molmo 72B | 72B | Q4_K_M | 69.5 GB | 4,096 | Comfortable fit with 74% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5-VL 72B | 72B | AWQ-INT4 | 121.6 GB | 8,192 | Comfortable fit with 54% headroom — room to extend context or run alongside other workloads. |
| Hermes 3 Llama 3.1 70B | 70B | Q4_K_M | 89.4 GB | 8,192 | Comfortable fit with 66% headroom — room to extend context or run alongside other workloads. |
| Hermes 4 Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| Llama 3.3 70B Instruct | 70B | Q8_0 | 90.8 GB | 8,192 | Comfortable fit with 66% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Llama 70B | 70B | Q5_K_M | 95.5 GB | 8,192 | Comfortable fit with 64% headroom — room to extend context or run alongside other workloads. |
| Dolphin 3 Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| EVA Llama 3.3 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| Llama 4 70B | 70B | AWQ-INT4 | 118.5 GB | 8,192 | Comfortable fit with 55% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| GLM-5 | 200B | Q4_K_M | 236.8 GB | 8,192 | Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| GLM-5 Pro | 144B | AWQ-INT4 | 233.2 GB | 8,192 | Tight fit at AWQ-INT4 — only 12% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Kimi K1.5 | 200B | AWQ-INT4 | 320 GB | 8,192 | ~320.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Qwen 3 235B-A22B | 235B | Q4_K_M | 276.5 GB | 8,192 | ~276.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build. |
| DeepSeek Coder V2 236B | 236B | Q4_K_M | 277.6 GB | 8,192 | ~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build. |
| DeepSeek V2.5 236B | 236B | Q4_K_M | 277.6 GB | 8,192 | ~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build. |
| Llama 3.1 Nemotron Ultra 253B | 253B | Q4_K_M | 296.9 GB | 8,192 | ~296.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 12%. Drop quant or move to a larger build. |
| DeepSeek V4 Flash (284B MoE) | 284B | Q4_K_M | 332 GB | 8,192 | ~332.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 25%. Drop quant or move to a larger build. |
| Hunyuan Large 389B MoE | 389B | Q4_K_M | 451.1 GB | 8,192 | ~451.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 70%. Drop quant or move to a larger build. |
| Qwen 3.5 235B-A17B (MoE) | 397B | Q4_K_M | 460.2 GB | 8,192 | ~460.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 73%. Drop quant or move to a larger build. |
| Jamba 1.5 Large | 398B | Q4_K_M | 461.3 GB | 8,192 | ~461.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 73%. Drop quant or move to a larger build. |
| Llama 4 Maverick | 400B | Q4_K_M | 463.6 GB | 8,192 | ~463.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 74%. Drop quant or move to a larger build. |
| Llama 4 405B | 405B | AWQ-INT4 | 637.7 GB | 8,192 | ~637.7 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 140%. Drop quant or move to a larger build. |
| DeepSeek V3 (671B MoE) | 671B | Q4_K_M | 770.9 GB | 8,192 | ~770.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 190%. Drop quant or move to a larger build. |
| DeepSeek R1 (671B reasoning) | 671B | Q4_K_M | 770.9 GB | 8,192 | ~770.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 190%. Drop quant or move to a larger build. |
| Mistral Medium 3.5 (675B MoE) | 675B | Q4_K_M | 775.4 GB | 8,192 | ~775.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 192%. Drop quant or move to a larger build. |
| DeepSeek V4 | 745B | AWQ-INT4 | 1164.7 GB | 8,192 | ~1164.7 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 338%. Drop quant or move to a larger build. |
| Kimi K2.6 | 1000B | Q4_K_M | 1143.9 GB | 8,192 | ~1143.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 330%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.