Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
Mixed-GPU (asymmetric) configuration. Tensor-parallel doesn't work cleanly because TP requires identical cards — your faster card stalls waiting on the slower one every layer. Use llama.cpp's layer-split with manual --tensor-split tuning to distribute layers by VRAM ratio. Effective capacity ~33 GB after layer-split overhead, but the slowest card (22 GB effective) bottlenecks single-tensor operations.
Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.
No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.
Best engine for this topology + skill level + use case.
Mixed-GPU configurations need llama.cpp's --tensor-split flag with manual ratio tuning by VRAM. vLLM's tensor-parallel requires identical cards and won't run cleanly here.
Inherits llama.cpp's layer-split path with friendlier UX. OLLAMA_GPU_OVERHEAD and per-card env vars do most of what manual flags do.
315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Pollux Judge 32B | 32B | Q4_K_M | 26.5 GB | 4,096 | No measured row yet | Fits cleanly at Q4_K_M + 4,096 ctx with 20% headroom. |
| Qwen 2.5 Coder 32B Instruct | 32B | Q4_K_M | 22.1 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 33% headroom. |
| Sarvam 30B | 30B | Q4_K_M | 24.8 GB | 4,096 | No measured row yet | Fits cleanly at Q4_K_M + 4,096 ctx with 25% headroom. |
| Gemma 4 Turkish 26B (4B active) | 26B | Q4_K_M | 28 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 15% headroom. |
| Mistral Small 3 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Mistral Medium 3 24B (dense) | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Dolphin 3.0 Mistral 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Mistral Saba 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Mistral Small 3.2 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Devstral Small 2 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Sarvam M | 24B | Q4_K_M | 19.9 GB | 4,096 | No measured row yet | Fits cleanly at Q4_K_M + 4,096 ctx with 40% headroom. |
| DeepSeek R1 Distill Mistral 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom. |
| Codestral 22B | 22B | Q4_K_M | 24.7 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 25% headroom. |
| GPT-OSS Swallow 20B RL v0.1 | 20B | Q4_K_M | 21.6 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 35% headroom. |
| GPT-NeoX 20B | 20B | Q4_K_M | 14.1 GB | 2,048 | No measured row yet | Comfortable fit with 57% headroom — room to extend context or run alongside other workloads. |
| DeepSeek V3 Lite (16B MoE) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | Comfortable fit with 46% headroom — room to extend context or run alongside other workloads. |
| DeepSeek Coder V2 Lite (16B) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | Comfortable fit with 46% headroom — room to extend context or run alongside other workloads. |
| Granite 3 MoE (3B active) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | Comfortable fit with 46% headroom — room to extend context or run alongside other workloads. |
| DeepSeek MoE 16B Base | 16B | Q4_K_M | 14 GB | 4,096 | No measured row yet | Comfortable fit with 58% headroom — room to extend context or run alongside other workloads. |
| DeepSeek V2 Lite Chat | 16B | Q4_K_M | 16.9 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| StarCoder 2 15B | 15B | Q4_K_M | 17 GB | 8,192 | No measured row yet | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Phi-4 14B | 14B | Q8_0 | 22.8 GB | 8,192 | No measured row yet | Fits cleanly at Q8_0 + 8,192 ctx with 31% headroom. |
| Qwen 2.5 14B Instruct | 14B | Q8_0 | 23.5 GB | 8,192 | No measured row yet | Fits cleanly at Q8_0 + 8,192 ctx with 29% headroom. |
| Qwen 2.5 Coder 14B Instruct | 14B | Q4_K_M | 15.8 GB | 8,192 | No measured row yet | Comfortable fit with 52% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Falcon 40B Instruct | 40B | Q4_K_M | 28.1 GB | 2,048 | No measured row yet | Tight fit at Q4_K_M — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Gemma 3 27B | 27B | Q4_K_M | 30.3 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3.6 27B (MTP) | 27B | Q5_K_M | 33 GB | 8,192 | No measured row yet | Tight fit at Q5_K_M — only 0% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| MedGemma 27B | 27B | Q4_K_M | 30.3 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| InternVL 2.5 26B | 26B | Q4_K_M | 29.8 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Gemma 4 26B MoE | 26B | Q4_K_M | 29.8 GB | 8,192 | No measured row yet | Tight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Qwen 3 30B-A3B | 30B | Q4_K_M | 33.9 GB | 8,192 | No measured row yet | ~33.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build. |
| Nemotron 3 Nano (30B-A3B) | 30B | Q4_K_M | 33.9 GB | 8,192 | No measured row yet | ~33.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build. |
| Omni 31B Turkish Reasoning | 31B | Q4_K_M | 33.5 GB | 8,192 | No measured row yet | ~33.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 1%. Drop quant or move to a larger build. |
| Gemma 4 31B Dense | 31B | Q4_K_M | 34.4 GB | 8,192 | No measured row yet | ~34.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build. |
| EXAONE 3.5 32B Instruct | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | ~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| EXAONE 3.5 32B Instruct AWQ | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | ~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| Qwen 2.5 32B Instruct | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| Magistral 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| Aya Expanse 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| QwQ 32B Preview | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Qwen 3 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| EXAONE 4.0.1 32B | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | ~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| Qwen 3 Coder 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| Qwen3 Swallow 32B RL v0.2 | 32B | Q4_K_M | 34.5 GB | 8,192 | No measured row yet | ~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build. |
| Qwen 3 32B | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Qwen 32B | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.