Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
8× NVIDIA GeForce RTX 4090 = 192 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/8 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~150 GB. Largest single tensor on one card is ~22 GB.
Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.
No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.
Best engine for this topology + skill level + use case.
Ray Serve is the runtime you selected. It is compatible enough for this build, so the custom engine keeps it as the primary path while still showing alternatives.
Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.
Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.
315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Sarvam 105B | 105B | Q4_K_M | 113.2 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 25% headroom. |
| Sarvam 105B FP8 | 105B | Q4_K_M | 113.2 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 25% headroom. |
| Command R+ 104B | 104B | Q4_K_M | 115 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 23% headroom. |
| Command R+ (Aug 2024) | 104B | AWQ-INT4 | 115 GB | 8,192 | No measured row yet | Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom. |
| Llama 3.2 90B Vision | 90B | AWQ-INT4 | 99.6 GB | 8,192 | No measured row yet | Fits cleanly at AWQ-INT4 + 8,192 ctx with 34% headroom. |
| Llama 3.2 90B Vision Instruct | 90B | Q4_K_M | 98.6 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 34% headroom. |
| InternVL 2.5 78B | 78B | Q4_K_M | 86.3 GB | 8,192 | No measured row yet | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Math 72B | 72B | Q4_K_M | 61.1 GB | 4,096 | No measured row yet | Comfortable fit with 59% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 72B Instruct | 72B | Q5_K_M | 87.5 GB | 8,192 | No measured row yet | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| Molmo 72B | 72B | Q4_K_M | 61.1 GB | 4,096 | No measured row yet | Comfortable fit with 59% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5-VL 72B | 72B | AWQ-INT4 | 80.1 GB | 8,192 | No measured row yet | Comfortable fit with 47% headroom — room to extend context or run alongside other workloads. |
| Dolphin 3 Llama 3.3 70B | 70B | AWQ-INT4 | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Llama 70B | 70B | Q5_K_M | 84.4 GB | 8,192 | No measured row yet | Comfortable fit with 44% headroom — room to extend context or run alongside other workloads. |
| Llama 3.1 70B Instruct | 70B | Q5_K_M | 84.4 GB | 8,192 | No measured row yet | Comfortable fit with 44% headroom — room to extend context or run alongside other workloads. |
| Tulu 3 70B | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Hermes 3 Llama 3.1 70B | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Llama 3.3 70B Instruct | 70B | Q8_0 | 76.2 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Hermes 4 Llama 3.3 70B | 70B | AWQ-INT4 | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Llama 3.1 Nemotron 70B Instruct | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Hermes 4 70B FP8 | 70B | Q4_K_M | 75.4 GB | 8,192 | No measured row yet | Comfortable fit with 50% headroom — room to extend context or run alongside other workloads. |
| Llama 4 70B | 70B | AWQ-INT4 | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| OpenBioLLM Llama 3 70B | 70B | Q4_K_M | 79.1 GB | 8,192 | No measured row yet | Comfortable fit with 47% headroom — room to extend context or run alongside other workloads. |
| EVA Llama 3.3 70B | 70B | AWQ-INT4 | 77 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Jamba 1.5 Mini | 52B | Q4_K_M | 57.5 GB | 8,192 | No measured row yet | Comfortable fit with 62% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| DBRX Base | 132B | Q4_K_M | 144.8 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DBRX Instruct | 132B | AWQ-INT4 | 144.8 GB | 8,192 | No measured row yet | Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Mistral Large 2 (123B) | 123B | Q4_K_M | 138.2 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Nemotron 3 Super (120B-A12B) | 120B | Q4_K_M | 135.6 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Llama 4 Scout | 109B | Q5_K_M | 136.4 GB | 8,192 | No measured row yet | Tight fit at Q5_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Mixtral 8x22B Instruct | 141B | Q4_K_M | 158.7 GB | 8,192 | No measured row yet | ~158.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 6%. Drop quant or move to a larger build. |
| WizardLM-2 8x22B | 141B | Q4_K_M | 158.7 GB | 8,192 | No measured row yet | ~158.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 6%. Drop quant or move to a larger build. |
| GLM-5 Pro | 144B | AWQ-INT4 | 158.1 GB | 8,192 | No measured row yet | ~158.1 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| GLM-5 | 200B | Q4_K_M | 226 GB | 8,192 | No measured row yet | ~226.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 51%. Drop quant or move to a larger build. |
| Kimi K1.5 | 200B | AWQ-INT4 | 220.8 GB | 8,192 | No measured row yet | ~220.8 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 47%. Drop quant or move to a larger build. |
| Qwen 3 235B-A22B | 235B | Q4_K_M | 266.6 GB | 8,192 | No measured row yet | ~266.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 78%. Drop quant or move to a larger build. |
| DeepSeek Coder V2 236B | 236B | Q4_K_M | 258.7 GB | 8,192 | No measured row yet | ~258.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 72%. Drop quant or move to a larger build. |
| DeepSeek V2.5 236B | 236B | Q4_K_M | 258.7 GB | 8,192 | No measured row yet | ~258.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 72%. Drop quant or move to a larger build. |
| K-EXAONE 236B A23B | 236B | Q4_K_M | 254.3 GB | 8,192 | No measured row yet | ~254.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 70%. Drop quant or move to a larger build. |
| Llama 3.1 Nemotron Ultra 253B | 253B | Q4_K_M | 277.7 GB | 8,192 | No measured row yet | ~277.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 85%. Drop quant or move to a larger build. |
| DeepSeek V4 Flash (284B MoE) | 284B | Q4_K_M | 312.1 GB | 8,192 | No measured row yet | ~312.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build. |
| Hunyuan Large 389B MoE | 389B | Q4_K_M | 425.5 GB | 8,192 | No measured row yet | ~425.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 184%. Drop quant or move to a larger build. |
| Qwen 3.5 235B-A17B (MoE) | 397B | Q4_K_M | 435.8 GB | 8,192 | No measured row yet | ~435.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 191%. Drop quant or move to a larger build. |
| Jamba 1.5 Large | 398B | Q4_K_M | 440.5 GB | 8,192 | No measured row yet | ~440.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 194%. Drop quant or move to a larger build. |
| Llama 4 Maverick | 400B | Q4_K_M | 452 GB | 8,192 | No measured row yet | ~452.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 201%. Drop quant or move to a larger build. |
| Llama 4 405B | 405B | AWQ-INT4 | 444 GB | 8,192 | No measured row yet | ~444.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 196%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.