Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
2× NVIDIA GeForce RTX 3090 = 48 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/2 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~37 GB. Largest single tensor on one card is ~22 GB.
Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.
No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.
Best engine for this topology + skill level + use case.
Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.
Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.
Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.
315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Falcon 40B Instruct | 40B | Q4_K_M | 28.1 GB | 2,048 | No measured row yet | Fits cleanly at Q4_K_M + 2,048 ctx with 24% headroom. |
| Pollux Judge 32B | 32B | Q4_K_M | 26.5 GB | 4,096 | No measured row yet | Fits cleanly at Q4_K_M + 4,096 ctx with 28% headroom. |
| Qwen 2.5 Coder 32B Instruct | 32B | Q4_K_M | 22.1 GB | 8,192 | No measured row yet | Comfortable fit with 40% headroom — room to extend context or run alongside other workloads. |
| Sarvam 30B | 30B | Q4_K_M | 24.8 GB | 4,096 | No measured row yet | Fits cleanly at Q4_K_M + 4,096 ctx with 33% headroom. |
| Gemma 3 27B | 27B | Q4_K_M | 30.3 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 18% headroom. |
| MedGemma 27B | 27B | Q4_K_M | 30.3 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 18% headroom. |
| InternVL 2.5 26B | 26B | Q4_K_M | 29.8 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Gemma 4 Turkish 26B (4B active) | 26B | Q4_K_M | 28 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 24% headroom. |
| Gemma 4 26B MoE | 26B | Q4_K_M | 29.8 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Mistral Small 3 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| Mistral Medium 3 24B (dense) | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| Dolphin 3.0 Mistral 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| Mistral Saba 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| Mistral Small 3.2 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| Devstral Small 2 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| Sarvam M | 24B | Q4_K_M | 19.9 GB | 4,096 | No measured row yet | Comfortable fit with 46% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Mistral 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom. |
| GPT-OSS Swallow 20B RL v0.1 | 20B | Q4_K_M | 21.6 GB | 8,192 | No measured row yet | Comfortable fit with 42% headroom — room to extend context or run alongside other workloads. |
| GPT-NeoX 20B | 20B | Q4_K_M | 14.1 GB | 2,048 | No measured row yet | Comfortable fit with 62% headroom — room to extend context or run alongside other workloads. |
| DeepSeek V3 Lite (16B MoE) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | Comfortable fit with 51% headroom — room to extend context or run alongside other workloads. |
| DeepSeek Coder V2 Lite (16B) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | Comfortable fit with 51% headroom — room to extend context or run alongside other workloads. |
| Granite 3 MoE (3B active) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | Comfortable fit with 51% headroom — room to extend context or run alongside other workloads. |
| DeepSeek MoE 16B Base | 16B | Q4_K_M | 14 GB | 4,096 | No measured row yet | Comfortable fit with 62% headroom — room to extend context or run alongside other workloads. |
| DeepSeek V2 Lite Chat | 16B | Q4_K_M | 16.9 GB | 8,192 | No measured row yet | Comfortable fit with 54% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Qwen 3.6 35B-A3B (MTP) | 35B | Q3_K_M | 35.4 GB | 8,192 | No measured row yet | Tight fit at Q3_K_M — only 4% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DeepSeek Coder V3 | 33B | AWQ-INT4 | 36.5 GB | 8,192 | No measured row yet | Tight fit at AWQ-INT4 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| EXAONE 3.5 32B Instruct | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| EXAONE 3.5 32B Instruct AWQ | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 2.5 32B Instruct | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Magistral 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Aya Expanse 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| QwQ 32B Preview | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DeepSeek R1 Distill Qwen 3 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| EXAONE 4.0.1 32B | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 Coder 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen3 Swallow 32B RL v0.2 | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 32B | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DeepSeek R1 Distill Qwen 32B | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| OLMo 2 32B | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| llm-jp 4 32B A3B Thinking | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Phind CodeLlama 34B v2 | 34B | Q4_K_M | 38 GB | 8,192 | No measured row yet | ~38.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build. |
| Yi 1.5 34B | 34B | Q4_K_M | 38 GB | 8,192 | No measured row yet | ~38.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build. |
| Aya 23 35B | 35B | Q4_K_M | 39.6 GB | 8,192 | No measured row yet | ~39.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 7%. Drop quant or move to a larger build. |
| Mihenk LLM v2 35B (Turkish Financial) | 35B | Q4_K_M | 37.8 GB | 8,192 | No measured row yet | ~37.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 2%. Drop quant or move to a larger build. |
| Command R 35B | 35B | Q4_K_M | 39.6 GB | 8,192 | No measured row yet | ~39.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 7%. Drop quant or move to a larger build. |
| ALIA 40b instruct 2601 | 40B | Q4_K_M | 43.1 GB | 8,192 | No measured row yet | ~43.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 17%. Drop quant or move to a larger build. |
| Mixtral 8X7B Instruct v0.1 GPTQ | 47B | Q4_K_M | 50.3 GB | 8,192 | No measured row yet | ~50.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 36%. Drop quant or move to a larger build. |
| Mixtral 8x7B Instruct | 47B | Q4_K_M | 52.9 GB | 8,192 | No measured row yet | ~52.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 43%. Drop quant or move to a larger build. |
| Nemotron 3 Super 49B | 49B | AWQ-INT4 | 53.9 GB | 8,192 | No measured row yet | ~53.9 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 46%. Drop quant or move to a larger build. |
| Jamba 1.5 Mini | 52B | Q4_K_M | 57.5 GB | 8,192 | No measured row yet | ~57.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 55%. Drop quant or move to a larger build. |
| Dolphin 3 Llama 3.3 70B | 70B | AWQ-INT4 | 77 GB | 8,192 | No measured row yet | ~77.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Llama 70B | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | ~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build. |
| Llama 3.1 70B Instruct | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | ~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build. |
| Tulu 3 70B | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | ~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build. |
| Hermes 3 Llama 3.1 70B | 70B | Q4_K_M | 77 GB | 8,192 | No measured row yet | ~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build. |
| Llama 3.3 70B Instruct | 70B | Q4_K_M | 44.7 GB | 8,192 | No measured row yet | ~44.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 21%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.