RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
← Back to Will-it-run

Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

Build summary

Total VRAM
48 GB
Effective VRAM
~46 GB
range 42-46 GB
Topology
single gpu
none
Setup difficulty
beginner
speed penalty ~0%
Why effective VRAM is lower than total

Single NVIDIA RTX 4090 48GB (China-mod) — 48 GB VRAM minus ~1.8 GB runtime/driver overhead = ~46 GB usable for weights + KV cache + activations. The remaining uncertainty band covers OS display use and background CUDA allocations.

Measured evidence on this hardware

Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.

No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.

Coding agents — agentic tool-call burst

Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.

Bottleneck: kv cache

Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.

Budget for
  • •32K context with KV-cache room to spare (~3-4 GB on 4090 AWQ-INT4)
  • •Prefix cache: prefer SGLang for >5 tool calls / task
  • •Decode latency: aim for >40 tok/s sustained

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM
primary
involved

AWQ-INT4 path fits 32B-class models on a 24 GB card with concurrent users. The production-default for self-hosted coding agents and multi-user serving.

ExLlamaV2
alternative
involved

Single-stream throughput king on consumer NVIDIA. EXL2 4.65bpw on a 4090 hits the highest tok/s in this class.

WORKLOAD PROFILE
FITS
Phind CodeLlama 34B v2 @ Q4_K_M, 8K context on NVIDIA RTX 4090 48GB (China-mod)
0 GB48 GBVRAM ceiling
Weights20 GB
KV cache17 GB
Activations1.0 GB
Runtime1.8 GB
Headroom8.2 GB
ESTIMATED DECODE RATE
35 tok/s
Bandwidth-derived estimate · efficiency 0.70. Real-world rates land within ±20% on well-tuned runtimes.
35 tokens per second02550100150

Models that fit your build

63 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable
24 models · ≥15% headroom
ModelParamsQuantVRAM est.ContextEvidenceNote
Phind CodeLlama 34B v234BQ4_K_M38 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 17% headroom.
DeepSeek Coder V333BAWQ-INT436.5 GB8,192No measured row yetFits cleanly at AWQ-INT4 + 8,192 ctx with 21% headroom.
Qwen 2.5 32B Instruct32BQ4_K_M36 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 22% headroom.
DeepSeek R1 Distill Qwen 3 32B32BAWQ-INT436 GB8,192No measured row yetFits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
Qwen 3 Coder 32B32BAWQ-INT436 GB8,192No measured row yetFits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
Qwen 2.5 Coder 32B Instruct32BQ8_037.9 GB8,192No measured row yetFits cleanly at Q8_0 + 8,192 ctx with 18% headroom.
Qwen3 Swallow 32B RL v0.232BQ4_K_M34.5 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 25% headroom.
Gemma 4 31B Dense31BQ4_K_M34.4 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 25% headroom.
Qwen 3 30B-A3B30BQ4_K_M33.9 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 26% headroom.
Sarvam 30B30BQ4_K_M24.8 GB4,096No measured row yetComfortable fit with 46% headroom — room to extend context or run alongside other workloads.
Devstral Small 2 24B24BQ4_K_M26.7 GB8,192No measured row yetComfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Codestral 22B22BQ8_035.2 GB8,192No measured row yetFits cleanly at Q8_0 + 8,192 ctx with 24% headroom.
DeepSeek V3 Lite (16B MoE)16BQ4_K_M18 GB8,192No measured row yetComfortable fit with 61% headroom — room to extend context or run alongside other workloads.
DeepSeek Coder V2 Lite (16B)16BQ4_K_M18 GB8,192No measured row yetComfortable fit with 61% headroom — room to extend context or run alongside other workloads.
StarCoder 2 15B15BQ4_K_M17 GB8,192No measured row yetComfortable fit with 63% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 14B Instruct14BQ8_023.5 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 14B Instruct14BQ4_K_M15.8 GB8,192No measured row yetComfortable fit with 66% headroom — room to extend context or run alongside other workloads.
Qwen 3 14B14BQ8_022.8 GB8,192No measured row yetComfortable fit with 51% headroom — room to extend context or run alongside other workloads.
Yi Coder 9B9BQ4_K_M10.2 GB8,192No measured row yetComfortable fit with 78% headroom — room to extend context or run alongside other workloads.
OpenCoder 8B8BQ4_K_M13 GB16,384No measured row yetComfortable fit with 72% headroom — room to extend context or run alongside other workloads.
Llama 3.1 8B Instruct8BFP1619.1 GB16,384No measured row yetComfortable fit with 59% headroom — room to extend context or run alongside other workloads.
Qwen 3 8B8BQ8_016.6 GB16,384No measured row yetComfortable fit with 64% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Llama 8B8BQ4_K_M13 GB16,384No measured row yetComfortable fit with 72% headroom — room to extend context or run alongside other workloads.
Gervásio 8B PTPT8BQ4_K_M6.6 GB4,096No measured row yetComfortable fit with 86% headroom — room to extend context or run alongside other workloads.
Borderline
5 models · tight, may need quant downgrade
ModelParamsQuantVRAM est.ContextEvidenceNote
Llama 3.3 70B Instruct70BQ4_K_M44.7 GB8,192No measured row yetTight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3.6 35B-A3B (MTP)35BQ5_K_M42.8 GB8,192No measured row yetTight fit at Q5_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 32B32BQ5_K_M39.1 GB8,192No measured row yetTight fit at Q5_K_M — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3.6 27B (MTP)27BQ8_043.6 GB8,192No measured row yetTight fit at Q8_0 — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Mistral Small 3 24B24BQ8_039.3 GB8,192No measured row yetTight fit at Q8_0 — only 15% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Not practical
15 models · oversize for this build
ModelParamsQuantVRAM est.ContextEvidenceNote
Llama 3.1 70B Instruct70BQ4_K_M77 GB8,192No measured row yet~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 67%. Drop quant or move to a larger build.
Llama 4 70B70BAWQ-INT477 GB8,192No measured row yet~77.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 67%. Drop quant or move to a larger build.
Sarvam 105B105BQ4_K_M113.2 GB8,192No measured row yet~113.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 146%. Drop quant or move to a larger build.
Llama 4 Scout109BQ4_K_M122.8 GB8,192No measured row yet~122.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 167%. Drop quant or move to a larger build.
Qwen 3 235B-A22B235BQ4_K_M266.6 GB8,192No measured row yet~266.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 480%. Drop quant or move to a larger build.
DeepSeek Coder V2 236B236BQ4_K_M258.7 GB8,192No measured row yet~258.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 462%. Drop quant or move to a larger build.
DeepSeek V2.5 236B236BQ4_K_M258.7 GB8,192No measured row yet~258.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 462%. Drop quant or move to a larger build.
DeepSeek V4 Flash (284B MoE)284BQ4_K_M312.1 GB8,192No measured row yet~312.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 578%. Drop quant or move to a larger build.
MiniMax-M3428BQ3_K_M433.1 GB8,192No measured row yet~433.1 GB needed at Q3_K_M + 8,192 ctx — overshoots effective VRAM by 842%. Drop quant or move to a larger build.
DeepSeek V4745BAWQ-INT4818.8 GB8,192No measured row yet~818.8 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 1680%. Drop quant or move to a larger build.
GLM-5.2753BQ3_K_M762 GB8,192No measured row yet~762.0 GB needed at Q3_K_M + 8,192 ctx — overshoots effective VRAM by 1556%. Drop quant or move to a larger build.
Kimi K2.61000BQ4_K_M1130 GB8,192No measured row yet~1130.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 2357%. Drop quant or move to a larger build.
Ring-2.6-1T1000BQ3_K_M1011.9 GB8,192No measured row yet~1011.9 GB needed at Q3_K_M + 8,192 ctx — overshoots effective VRAM by 2100%. Drop quant or move to a larger build.
Kimi K2.7-Code1000BQ3_K_M1011.9 GB8,192No measured row yet~1011.9 GB needed at Q3_K_M + 8,192 ctx — overshoots effective VRAM by 2100%. Drop quant or move to a larger build.
DeepSeek V4 Pro (1.6T MoE)1600BQ4_K_M1766 GB8,192No measured row yet~1766.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3739%. Drop quant or move to a larger build.

Related

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

See something off?Submit a benchmark·Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.