Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
Single NVIDIA GeForce RTX 5080 — 16 GB VRAM minus ~1.5 GB runtime overhead = ~14 GB usable for weights + KV cache + activations. The 8% headroom we reserve covers the typical OS/driver footprint and gives KV-cache room for an 8K-32K context.
Best engine for this topology + skill level + use case.
183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Qwen 2.5 Math 7B | 7B | Q4_K_M | 12.1 GB | 4,096 | Fits cleanly at Q4_K_M + 4,096 ctx with 19% headroom. |
| Janus-Pro 7B | 7B | Q4_K_M | 12.1 GB | 4,096 | Fits cleanly at Q4_K_M + 4,096 ctx with 19% headroom. |
| Nemotron Mini 4B Instruct | 4B | Q4_K_M | 9.4 GB | 4,096 | Fits cleanly at Q4_K_M + 4,096 ctx with 37% headroom. |
| EXAONE 3.5 2.4B | 2B | Q4_K_M | 12.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 15% headroom. |
| Granite 3.0 2B Instruct | 2B | Q4_K_M | 7.7 GB | 4,096 | Comfortable fit with 49% headroom — room to extend context or run alongside other workloads. |
| Moondream 2 | 2B | Q4_K_M | 5.3 GB | 2,048 | Comfortable fit with 65% headroom — room to extend context or run alongside other workloads. |
| SmolLM 2 1.7B Instruct | 2B | Q4_K_M | 11.9 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 21% headroom. |
| Whisper Large v3 | 2B | FP16 | 5.1 GB | 0 | Comfortable fit with 66% headroom — room to extend context or run alongside other workloads. |
| DeepSeek R1 Distill Qwen 1.5B | 2B | Q4_K_M | 11.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| Qwen 2.5 Coder 1.5B | 2B | Q4_K_M | 11.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| RWKV 7 'Goose' 1.5B | 2B | Q5_K_M | 11.8 GB | 8,192 | Fits cleanly at Q5_K_M + 8,192 ctx with 21% headroom. |
| Qwen 2.5 1.5B Instruct | 2B | Q4_K_M | 11.7 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom. |
| Llama 3.2 1B Instruct | 1B | Q8_0 | 11.6 GB | 8,192 | Fits cleanly at Q8_0 + 8,192 ctx with 23% headroom. |
| Gemma 3 1B | 1B | Q4_K_M | 11.1 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 26% headroom. |
| Whisper Large v3 Turbo | 1B | FP16 | 3.5 GB | 0 | Comfortable fit with 77% headroom — room to extend context or run alongside other workloads. |
| BGE M3 | 1B | FP16 | 11.5 GB | 8,192 | Fits cleanly at FP16 + 8,192 ctx with 24% headroom. |
| BGE Reranker v2 M3 | 1B | FP16 | 11.5 GB | 8,192 | Fits cleanly at FP16 + 8,192 ctx with 24% headroom. |
| Qwen 2.5 0.5B Instruct | 1B | Q4_K_M | 10.6 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 30% headroom. |
| SmolLM 2 360M Instruct | 0B | Q4_K_M | 10.4 GB | 8,192 | Fits cleanly at Q4_K_M + 8,192 ctx with 31% headroom. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| Molmo 7B-D | 8B | Q4_K_M | 13 GB | 4,096 | Tight fit at Q4_K_M — only 14% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Granite 3.0 8B Instruct | 8B | Q4_K_M | 13 GB | 4,096 | Tight fit at Q4_K_M — only 14% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 2.5 7B Instruct | 7B | Q4_K_M | 14.9 GB | 8,192 | Tight fit at Q4_K_M — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Phi-3.5 Vision | 4B | Q4_K_M | 14.8 GB | 8,192 | Tight fit at Q4_K_M — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 4B | 4B | Q4_K_M | 14.5 GB | 8,192 | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Gemma 4 E4B (Effective 4B) | 4B | Q4_K_M | 14.5 GB | 8,192 | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Gemma 3 4B | 4B | Q4_K_M | 14.5 GB | 8,192 | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| MiniCPM 3 4B | 4B | Q4_K_M | 14.5 GB | 8,192 | Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Phi-3.5 Mini Instruct | 4B | Q4_K_M | 14.3 GB | 8,192 | Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Phi-4 Mini 4B | 4B | Q4_K_M | 14.3 GB | 8,192 | Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Phi-4 Reasoning Mini 4B | 4B | Q4_K_M | 14.3 GB | 8,192 | Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Llama 3.2 3B Instruct | 3B | Q8_0 | 14.8 GB | 8,192 | Tight fit at Q8_0 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 2.5 Coder 3B | 3B | Q4_K_M | 13.4 GB | 8,192 | Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Dolphin 3.0 Llama 3.2 3B | 3B | Q4_K_M | 13.4 GB | 8,192 | Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Hermes 3 Llama 3.2 3B | 3B | Q4_K_M | 13.4 GB | 8,192 | Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Ministral 3B Instruct | 3B | Q4_K_M | 13.4 GB | 8,192 | Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Note |
|---|---|---|---|---|---|
| PaliGemma 2 3B | 3B | BF16 | 17.8 GB | 8,192 | ~17.8 GB needed at BF16 + 8,192 ctx — overshoots effective VRAM by 19%. Drop quant or move to a larger build. |
| CodeGemma 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Falcon Mamba 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Qwen 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Mistral 7B Instruct v0.3 | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Qwen 2.5 Coder 7B Instruct | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Qwen 2.5-VL 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Codestral Mamba 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Qwen 3 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Qwen 2-VL 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| StarCoder 2 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| CodeQwen 1.5 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| LLaVA 1.6 Mistral 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| LLaVA-OneVision 7B | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| Falcon 3 7B Instruct | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
| InternLM 2.5 7B Chat | 7B | Q4_K_M | 17.9 GB | 8,192 | ~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.