RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
← Back to Will-it-run

Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

Build summary

Total VRAM
16 GB
Effective VRAM
~15 GB
range 13-14 GB
Topology
single gpu
none
Setup difficulty
beginner
speed penalty ~0%
Why effective VRAM is lower than total

Single NVIDIA GeForce RTX 3080 16GB (Mobile) — 16 GB VRAM minus ~1.5 GB runtime overhead = ~14 GB usable for weights + KV cache + activations. The 8% headroom we reserve covers the typical OS/driver footprint and gives KV-cache room for an 8K-32K context.

Recommended runtime

Best engine for this topology + skill level + use case.

ExLlamaV2
primary
involved

Highest single-stream throughput on consumer NVIDIA. EXL2 mixed-bit quants are the leading consumer-tier inference format.

llama.cpp
alternative
moderate

Cross-format flexibility — GGUF works everywhere; the engine that powers Ollama and LM Studio.

Models that fit your build

183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable
19 models · ≥15% headroom
ModelParamsQuantVRAM est.ContextNote
Qwen 2.5 Math 7B7BQ4_K_M12.1 GB4,096Fits cleanly at Q4_K_M + 4,096 ctx with 19% headroom.
Janus-Pro 7B7BQ4_K_M12.1 GB4,096Fits cleanly at Q4_K_M + 4,096 ctx with 19% headroom.
Nemotron Mini 4B Instruct4BQ4_K_M9.4 GB4,096Fits cleanly at Q4_K_M + 4,096 ctx with 37% headroom.
EXAONE 3.5 2.4B2BQ4_K_M12.7 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 15% headroom.
Granite 3.0 2B Instruct2BQ4_K_M7.7 GB4,096Comfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Moondream 22BQ4_K_M5.3 GB2,048Comfortable fit with 65% headroom — room to extend context or run alongside other workloads.
SmolLM 2 1.7B Instruct2BQ4_K_M11.9 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 21% headroom.
Whisper Large v32BFP165.1 GB0Comfortable fit with 66% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Qwen 1.5B2BQ4_K_M11.7 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom.
Qwen 2.5 Coder 1.5B2BQ4_K_M11.7 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom.
RWKV 7 'Goose' 1.5B2BQ5_K_M11.8 GB8,192Fits cleanly at Q5_K_M + 8,192 ctx with 21% headroom.
Qwen 2.5 1.5B Instruct2BQ4_K_M11.7 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 22% headroom.
Llama 3.2 1B Instruct1BQ8_011.6 GB8,192Fits cleanly at Q8_0 + 8,192 ctx with 23% headroom.
Gemma 3 1B1BQ4_K_M11.1 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 26% headroom.
Whisper Large v3 Turbo1BFP163.5 GB0Comfortable fit with 77% headroom — room to extend context or run alongside other workloads.
BGE M31BFP1611.5 GB8,192Fits cleanly at FP16 + 8,192 ctx with 24% headroom.
BGE Reranker v2 M31BFP1611.5 GB8,192Fits cleanly at FP16 + 8,192 ctx with 24% headroom.
Qwen 2.5 0.5B Instruct1BQ4_K_M10.6 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 30% headroom.
SmolLM 2 360M Instruct0BQ4_K_M10.4 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 31% headroom.
Borderline
16 models · tight, may need quant downgrade
ModelParamsQuantVRAM est.ContextNote
Molmo 7B-D8BQ4_K_M13 GB4,096Tight fit at Q4_K_M — only 14% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Granite 3.0 8B Instruct8BQ4_K_M13 GB4,096Tight fit at Q4_K_M — only 14% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 2.5 7B Instruct7BQ4_K_M14.9 GB8,192Tight fit at Q4_K_M — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Phi-3.5 Vision4BQ4_K_M14.8 GB8,192Tight fit at Q4_K_M — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 4B4BQ4_K_M14.5 GB8,192Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Gemma 4 E4B (Effective 4B)4BQ4_K_M14.5 GB8,192Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Gemma 3 4B4BQ4_K_M14.5 GB8,192Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
MiniCPM 3 4B4BQ4_K_M14.5 GB8,192Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Phi-3.5 Mini Instruct4BQ4_K_M14.3 GB8,192Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Phi-4 Mini 4B4BQ4_K_M14.3 GB8,192Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Phi-4 Reasoning Mini 4B4BQ4_K_M14.3 GB8,192Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Llama 3.2 3B Instruct3BQ8_014.8 GB8,192Tight fit at Q8_0 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 2.5 Coder 3B3BQ4_K_M13.4 GB8,192Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Dolphin 3.0 Llama 3.2 3B3BQ4_K_M13.4 GB8,192Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Hermes 3 Llama 3.2 3B3BQ4_K_M13.4 GB8,192Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Ministral 3B Instruct3BQ4_K_M13.4 GB8,192Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Not practical
16 models · oversize for this build
ModelParamsQuantVRAM est.ContextNote
PaliGemma 2 3B3BBF1617.8 GB8,192~17.8 GB needed at BF16 + 8,192 ctx — overshoots effective VRAM by 19%. Drop quant or move to a larger build.
CodeGemma 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Falcon Mamba 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
DeepSeek R1 Distill Qwen 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Mistral 7B Instruct v0.37BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Qwen 2.5 Coder 7B Instruct7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Qwen 2.5-VL 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Codestral Mamba 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Qwen 3 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Qwen 2-VL 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
StarCoder 2 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
CodeQwen 1.5 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
LLaVA 1.6 Mistral 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
LLaVA-OneVision 7B7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Falcon 3 7B Instruct7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
InternLM 2.5 7B Chat7BQ4_K_M17.9 GB8,192~17.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.

Related

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

See something off?Submit a benchmark·Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.