Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.
Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.
Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.
Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.
Single NVIDIA GeForce RTX 3080 16GB (Mobile) — 16 GB VRAM minus ~1.8 GB runtime/driver overhead = ~14 GB usable for weights + KV cache + activations. The remaining uncertainty band covers OS display use and background CUDA allocations.
Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.
| Model | Evidence | Quant | Tok/s | Provenance |
|---|---|---|---|---|
| Turkcell LLM 7B v1 NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 85.77 | |
| RefinedNeuro RN TR R2 NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 79.27 | |
| RefinedNeuro RN TR R1 NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 79.89 | |
| Qwen 3 4B NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 103.7 | |
| Qwen 3 14B NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 38.26 | |
| Qwen 2.5 7B Instruct NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 80.42 | |
| Phi-4 Reasoning 14B NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 40.44 | |
| Phi-3.5 Mini Instruct NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 155.4 | |
| Mistral Nemo 12B Instruct NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 65.72 | |
| Mistral 7B Instruct v0.3 NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 89.64 | |
| Llama 3.2 11B Vision Instruct NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 67.00 | |
| Malhajar Mistral 7B Turkish NVIDIA GeForce RTX 3080 16GB (Mobile) | 4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96 | Q4_K_M | 87.28 |
| Model | Benchmark | Setup | Score | Log |
|---|---|---|---|---|
| Qwen 3 8B | HumanEval+ | Q4_K_M / ollama-0.24 | 2.4 | gist |
| Llama 3.1 8B Instruct | MBPP+ | Q4_K_M / ollama-0.24 | 39.2 | gist |
| Phi-4 14B | MBPP+ | Q4_K_M / ollama-0.24 | 60.3 | gist |
| Qwen 2.5 Coder 7B Instruct | MBPP+ | Q4_K_M / ollama-0.24 | 66.9 | gist |
| Llama 3.1 8B Instruct | HumanEval+ | Q4_K_M / ollama-0.24 | 56.1 | gist |
| Phi-4 14B | HumanEval+ | Q4_K_M / ollama-0.24 | 78.7 | gist |
| Qwen 2.5 Coder 7B Instruct | HumanEval+ | Q4_K_M / ollama-0.24 | 81.1 | gist |
| Llama 3.2 3B Instruct | TurkishMMLU (Generative) | Q4_K_M / ollama-0.24 | 11.4 | gist |
| Turkish Llama 8B Instruct v0.1 | TurkishMMLU (Generative) | Q4_K_M / ollama-0.24 | 11.0 | gist |
Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.
Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.
Best engine for this topology + skill level + use case.
AWQ-INT4 path fits 32B-class models on a 24 GB card with concurrent users. The production-default for self-hosted coding agents and multi-user serving.
Single-stream throughput king on consumer NVIDIA. EXL2 4.65bpw on a 4090 hits the highest tok/s in this class.
63 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Yi Coder 9B | 9B | Q4_K_M | 10.2 GB | 8,192 | No measured row yet | Fits cleanly at Q4_K_M + 8,192 ctx with 27% headroom. |
| Llama 3.1 8B Instruct | 8B | Q8_0 | 11.1 GB | 16,384 | No measured row yet | Fits cleanly at Q8_0 + 16,384 ctx with 21% headroom. |
| Gervásio 8B PTPT | 8B | Q4_K_M | 6.6 GB | 4,096 | No measured row yet | Comfortable fit with 53% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 7B Instruct | 7B | Q8_0 | 9.5 GB | 16,384 | Measured on this hardware: 80.42 tok/s at Q4_K_M (4,096 measured ctx). Fits cleanly at Q8_0 + 16,384 ctx with 32% headroom. | |
| CodeGemma 7B | 7B | Q4_K_M | 7.9 GB | 8,192 | No measured row yet | Comfortable fit with 43% headroom — room to extend context or run alongside other workloads. |
| Codestral Mamba 7B | 7B | Q4_K_M | 11.4 GB | 16,384 | No measured row yet | Fits cleanly at Q4_K_M + 16,384 ctx with 18% headroom. |
| CodeQwen 1.5 7B | 7B | Q4_K_M | 11.6 GB | 16,384 | No measured row yet | Fits cleanly at Q4_K_M + 16,384 ctx with 17% headroom. |
| StarCoder 2 7B | 7B | Q4_K_M | 11.6 GB | 16,384 | No measured row yet | Fits cleanly at Q4_K_M + 16,384 ctx with 17% headroom. |
| Salamandra 7B | 7B | Q4_K_M | 7.6 GB | 8,192 | No measured row yet | Comfortable fit with 46% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Coder 3B | 3B | Q4_K_M | 8 GB | 32,768 | No measured row yet | Comfortable fit with 43% headroom — room to extend context or run alongside other workloads. |
| StarCoder 2 3B | 3B | Q4_K_M | 5.1 GB | 16,384 | No measured row yet | Comfortable fit with 63% headroom — room to extend context or run alongside other workloads. |
| ColPali v1.3 | 3B | Q4_K_M | 1.8 GB | 0 | No measured row yet | Comfortable fit with 87% headroom — room to extend context or run alongside other workloads. |
| Salamandra 2B | 2B | Q4_K_M | 2.4 GB | 8,192 | No measured row yet | Comfortable fit with 83% headroom — room to extend context or run alongside other workloads. |
| mxbai-rerank-large-v2 | 2B | Q4_K_M | 4 GB | 32,768 | No measured row yet | Comfortable fit with 72% headroom — room to extend context or run alongside other workloads. |
| Qwen 2.5 Coder 1.5B | 2B | Q4_K_M | 4.1 GB | 32,768 | No measured row yet | Comfortable fit with 71% headroom — room to extend context or run alongside other workloads. |
| Florence-2 Large | 1B | Q4_K_M | 0.4 GB | 0 | No measured row yet | Comfortable fit with 97% headroom — room to extend context or run alongside other workloads. |
| GOT-OCR 2.0 | 1B | Q4_K_M | 0.3 GB | 0 | No measured row yet | Comfortable fit with 98% headroom — room to extend context or run alongside other workloads. |
| SigLIP SO400M (patch14-384) | 0B | Q4_K_M | 0.3 GB | 0 | No measured row yet | Comfortable fit with 98% headroom — room to extend context or run alongside other workloads. |
| Jina Reranker v2 Base Multilingual | 0B | Q4_K_M | 0.2 GB | 1,024 | No measured row yet | Comfortable fit with 99% headroom — room to extend context or run alongside other workloads. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Qwen 3 14B | 14B | Q4_K_M | 15.8 GB | 8,192 | Measured on this hardware: 38.26 tok/s at Q4_K_M (4,096 measured ctx). The larger target context still needs validation before calling it comfortable. | |
| OpenCoder 8B | 8B | Q4_K_M | 13 GB | 16,384 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 3 8B | 8B | Q4_K_M | 13.1 GB | 16,384 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| DeepSeek R1 Distill Llama 8B | 8B | Q4_K_M | 13 GB | 16,384 | No measured row yet | Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| EXAONE Deep 7.8B | 8B | Q4_K_M | 12.3 GB | 16,384 | No measured row yet | Tight fit at Q4_K_M — only 12% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Qwen 2.5 Coder 7B Instruct | 7B | Q6_K | 13.6 GB | 16,384 | No measured row yet | Tight fit at Q6_K — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| VibeThinker-3B | 3B | FP16 | 12.3 GB | 32,768 | No measured row yet | Tight fit at FP16 — only 12% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level. |
| Model | Params | Quant | VRAM est. | Context | Evidence | Note |
|---|---|---|---|---|---|---|
| Qwen 2.5 14B Instruct | 14B | Q4_K_M | 16.4 GB | 8,192 | No measured row yet | ~16.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 17%. Drop quant or move to a larger build. |
| Qwen 2.5 Coder 14B Instruct | 14B | Q4_K_M | 15.8 GB | 8,192 | No measured row yet | ~15.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build. |
| StarCoder 2 15B | 15B | Q4_K_M | 17 GB | 8,192 | No measured row yet | ~17.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 21%. Drop quant or move to a larger build. |
| DeepSeek V3 Lite (16B MoE) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | ~18.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 28%. Drop quant or move to a larger build. |
| DeepSeek Coder V2 Lite (16B) | 16B | Q4_K_M | 18 GB | 8,192 | No measured row yet | ~18.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 28%. Drop quant or move to a larger build. |
| Codestral 22B | 22B | Q4_K_M | 24.7 GB | 8,192 | No measured row yet | ~24.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 76%. Drop quant or move to a larger build. |
| Mistral Small 3 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | ~26.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 91%. Drop quant or move to a larger build. |
| Devstral Small 2 24B | 24B | Q4_K_M | 26.7 GB | 8,192 | No measured row yet | ~26.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 91%. Drop quant or move to a larger build. |
| Qwen 3.6 27B (MTP) | 27B | Q3_K_M | 27.3 GB | 8,192 | No measured row yet | ~27.3 GB needed at Q3_K_M + 8,192 ctx — overshoots effective VRAM by 95%. Drop quant or move to a larger build. |
| Qwen 3 30B-A3B | 30B | Q4_K_M | 33.9 GB | 8,192 | No measured row yet | ~33.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 142%. Drop quant or move to a larger build. |
| Sarvam 30B | 30B | Q4_K_M | 24.8 GB | 4,096 | No measured row yet | ~24.8 GB needed at Q4_K_M + 4,096 ctx — overshoots effective VRAM by 77%. Drop quant or move to a larger build. |
| Gemma 4 31B Dense | 31B | Q4_K_M | 34.4 GB | 8,192 | No measured row yet | ~34.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 146%. Drop quant or move to a larger build. |
| Qwen 2.5 32B Instruct | 32B | Q4_K_M | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build. |
| DeepSeek R1 Distill Qwen 3 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build. |
| Qwen 3 Coder 32B | 32B | AWQ-INT4 | 36 GB | 8,192 | No measured row yet | ~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build. |
| Qwen 2.5 Coder 32B Instruct | 32B | Q4_K_M | 22.1 GB | 8,192 | No measured row yet | ~22.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 58%. Drop quant or move to a larger build. |
NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.
Curated multi-GPU / cluster setups with effective-VRAM math.
OS + runtime install commands for your stack.
Runtime × OS × hardware support truth table.
If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.
Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.
Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.