RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
← Back to Will-it-run

Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

Build summary

Total VRAM
48 GB
Effective VRAM
~33 GB
range 28-36 GB
Topology
mixed gpu
pcie
Setup difficulty
advanced
speed penalty ~35%
Why effective VRAM is lower than total

Mixed-GPU (asymmetric) configuration. Tensor-parallel doesn't work cleanly because TP requires identical cards — your faster card stalls waiting on the slower one every layer. Use llama.cpp's layer-split with manual --tensor-split tuning to distribute layers by VRAM ratio. Effective capacity ~33 GB after layer-split overhead, but the slowest card (22 GB effective) bottlenecks single-tensor operations.

Measured evidence on this hardware

Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.

No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.

Recommended runtime

Best engine for this topology + skill level + use case.

llama.cpp (layer-split)
primary
involved

Mixed-GPU configurations need llama.cpp's --tensor-split flag with manual ratio tuning by VRAM. vLLM's tensor-parallel requires identical cards and won't run cleanly here.

Ollama
alternative
moderate

Inherits llama.cpp's layer-split path with friendlier UX. OLLAMA_GPU_OVERHEAD and per-card env vars do most of what manual flags do.

WORKLOAD PROFILE
FITS
Pollux Judge 32B @ Q4_K_M, 4K context on NVIDIA GeForce RTX 4090
0 GB33 GBVRAM ceiling
Weights18 GB
KV cache8.0 GB
Activations0.9 GB
Runtime1.8 GB
Headroom4.7 GB
ESTIMATED DECODE RATE
32 tok/s
Bandwidth-derived estimate · efficiency 0.55. Real-world rates land within ±20% on well-tuned runtimes.
32 tokens per second02550100150

Models that fit your build

315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable
24 models · ≥15% headroom
ModelParamsQuantVRAM est.ContextEvidenceNote
Pollux Judge 32B32BQ4_K_M26.5 GB4,096No measured row yetFits cleanly at Q4_K_M + 4,096 ctx with 20% headroom.
Qwen 2.5 Coder 32B Instruct32BQ4_K_M22.1 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 33% headroom.
Sarvam 30B30BQ4_K_M24.8 GB4,096No measured row yetFits cleanly at Q4_K_M + 4,096 ctx with 25% headroom.
Gemma 4 Turkish 26B (4B active)26BQ4_K_M28 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 15% headroom.
Mistral Small 3 24B24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Mistral Medium 3 24B (dense)24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Dolphin 3.0 Mistral 24B24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Mistral Saba 24B24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Mistral Small 3.2 24B24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Devstral Small 2 24B24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Sarvam M24BQ4_K_M19.9 GB4,096No measured row yetFits cleanly at Q4_K_M + 4,096 ctx with 40% headroom.
DeepSeek R1 Distill Mistral 24B24BQ4_K_M26.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Codestral 22B22BQ4_K_M24.7 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 25% headroom.
GPT-OSS Swallow 20B RL v0.120BQ4_K_M21.6 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 35% headroom.
GPT-NeoX 20B20BQ4_K_M14.1 GB2,048No measured row yetComfortable fit with 57% headroom — room to extend context or run alongside other workloads.
DeepSeek V3 Lite (16B MoE)16BQ4_K_M18 GB8,192No measured row yetComfortable fit with 46% headroom — room to extend context or run alongside other workloads.
DeepSeek Coder V2 Lite (16B)16BQ4_K_M18 GB8,192No measured row yetComfortable fit with 46% headroom — room to extend context or run alongside other workloads.
Granite 3 MoE (3B active)16BQ4_K_M18 GB8,192No measured row yetComfortable fit with 46% headroom — room to extend context or run alongside other workloads.
DeepSeek MoE 16B Base16BQ4_K_M14 GB4,096No measured row yetComfortable fit with 58% headroom — room to extend context or run alongside other workloads.
DeepSeek V2 Lite Chat16BQ4_K_M16.9 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
StarCoder 2 15B15BQ4_K_M17 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Phi-4 14B14BQ8_022.8 GB8,192No measured row yetFits cleanly at Q8_0 + 8,192 ctx with 31% headroom.
Qwen 2.5 14B Instruct14BQ8_023.5 GB8,192No measured row yetFits cleanly at Q8_0 + 8,192 ctx with 29% headroom.
Qwen 2.5 Coder 14B Instruct14BQ4_K_M15.8 GB8,192No measured row yetComfortable fit with 52% headroom — room to extend context or run alongside other workloads.
Borderline
6 models · tight, may need quant downgrade
ModelParamsQuantVRAM est.ContextEvidenceNote
Falcon 40B Instruct40BQ4_K_M28.1 GB2,048No measured row yetTight fit at Q4_K_M — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Gemma 3 27B27BQ4_K_M30.3 GB8,192No measured row yetTight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3.6 27B (MTP)27BQ5_K_M33 GB8,192No measured row yetTight fit at Q5_K_M — only 0% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
MedGemma 27B27BQ4_K_M30.3 GB8,192No measured row yetTight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
InternVL 2.5 26B26BQ4_K_M29.8 GB8,192No measured row yetTight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Gemma 4 26B MoE26BQ4_K_M29.8 GB8,192No measured row yetTight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Not practical
16 models · oversize for this build
ModelParamsQuantVRAM est.ContextEvidenceNote
Qwen 3 30B-A3B30BQ4_K_M33.9 GB8,192No measured row yet~33.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build.
Nemotron 3 Nano (30B-A3B)30BQ4_K_M33.9 GB8,192No measured row yet~33.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build.
Omni 31B Turkish Reasoning31BQ4_K_M33.5 GB8,192No measured row yet~33.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 1%. Drop quant or move to a larger build.
Gemma 4 31B Dense31BQ4_K_M34.4 GB8,192No measured row yet~34.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build.
EXAONE 3.5 32B Instruct32BQ4_K_M34.5 GB8,192No measured row yet~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
EXAONE 3.5 32B Instruct AWQ32BQ4_K_M34.5 GB8,192No measured row yet~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
Qwen 2.5 32B Instruct32BQ4_K_M36 GB8,192No measured row yet~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
Magistral 32B32BAWQ-INT436 GB8,192No measured row yet~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
Aya Expanse 32B32BAWQ-INT436 GB8,192No measured row yet~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
QwQ 32B Preview32BQ4_K_M36 GB8,192No measured row yet~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
DeepSeek R1 Distill Qwen 3 32B32BAWQ-INT436 GB8,192No measured row yet~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
EXAONE 4.0.1 32B32BQ4_K_M34.5 GB8,192No measured row yet~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
Qwen 3 Coder 32B32BAWQ-INT436 GB8,192No measured row yet~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
Qwen3 Swallow 32B RL v0.232BQ4_K_M34.5 GB8,192No measured row yet~34.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
Qwen 3 32B32BQ4_K_M36 GB8,192No measured row yet~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.
DeepSeek R1 Distill Qwen 32B32BQ4_K_M36 GB8,192No measured row yet~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 9%. Drop quant or move to a larger build.

Related

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

See something off?Submit a benchmark·Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.