RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
← Back to Will-it-run

Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

Build summary

Total VRAM
192 GB
Effective VRAM
~150 GB
range 142-154 GB
Topology
single node multi gpu
pcie
Setup difficulty
advanced
speed penalty ~18%
Why effective VRAM is lower than total

8× NVIDIA GeForce RTX 4090 = 192 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/8 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~150 GB. Largest single tensor on one card is ~22 GB.

Measured evidence on this hardware

Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.

No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.

Recommended runtime

Best engine for this topology + skill level + use case.

Ray Serve
primary
involved

Ray Serve is the runtime you selected. It is compatible enough for this build, so the custom engine keeps it as the primary path while still showing alternatives.

vLLM
primary
involved

Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.

ExLlamaV2
alternative
involved

Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.

WORKLOAD PROFILE
FITS
Sarvam 105B @ Q4_K_M, 8K context on NVIDIA GeForce RTX 4090
0 GB150 GBVRAM ceiling
Weights58 GB
KV cache53 GB
Activations2.9 GB
Runtime1.8 GB
Headroom35 GB
ESTIMATED DECODE RATE
10 tok/s
Bandwidth-derived estimate · efficiency 0.55. Real-world rates land within ±20% on well-tuned runtimes.
10 tokens per second02550100150

Models that fit your build

315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable
24 models · ≥15% headroom
ModelParamsQuantVRAM est.ContextEvidenceNote
Sarvam 105B105BQ4_K_M113.2 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 25% headroom.
Sarvam 105B FP8105BQ4_K_M113.2 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 25% headroom.
Command R+ 104B104BQ4_K_M115 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 23% headroom.
Command R+ (Aug 2024)104BAWQ-INT4115 GB8,192No measured row yetFits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom.
Llama 3.2 90B Vision90BAWQ-INT499.6 GB8,192No measured row yetFits cleanly at AWQ-INT4 + 8,192 ctx with 34% headroom.
Llama 3.2 90B Vision Instruct90BQ4_K_M98.6 GB8,192No measured row yetFits cleanly at Q4_K_M + 8,192 ctx with 34% headroom.
InternVL 2.5 78B78BQ4_K_M86.3 GB8,192No measured row yetComfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Math 72B72BQ4_K_M61.1 GB4,096No measured row yetComfortable fit with 59% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 72B Instruct72BQ5_K_M87.5 GB8,192No measured row yetComfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Molmo 72B72BQ4_K_M61.1 GB4,096No measured row yetComfortable fit with 59% headroom — room to extend context or run alongside other workloads.
Qwen 2.5-VL 72B72BAWQ-INT480.1 GB8,192No measured row yetComfortable fit with 47% headroom — room to extend context or run alongside other workloads.
Dolphin 3 Llama 3.3 70B70BAWQ-INT477 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Llama 70B70BQ5_K_M84.4 GB8,192No measured row yetComfortable fit with 44% headroom — room to extend context or run alongside other workloads.
Llama 3.1 70B Instruct70BQ5_K_M84.4 GB8,192No measured row yetComfortable fit with 44% headroom — room to extend context or run alongside other workloads.
Tulu 3 70B70BQ4_K_M77 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Hermes 3 Llama 3.1 70B70BQ4_K_M77 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Llama 3.3 70B Instruct70BQ8_076.2 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Hermes 4 Llama 3.3 70B70BAWQ-INT477 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Llama 3.1 Nemotron 70B Instruct70BQ4_K_M77 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Hermes 4 70B FP870BQ4_K_M75.4 GB8,192No measured row yetComfortable fit with 50% headroom — room to extend context or run alongside other workloads.
Llama 4 70B70BAWQ-INT477 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
OpenBioLLM Llama 3 70B70BQ4_K_M79.1 GB8,192No measured row yetComfortable fit with 47% headroom — room to extend context or run alongside other workloads.
EVA Llama 3.3 70B70BAWQ-INT477 GB8,192No measured row yetComfortable fit with 49% headroom — room to extend context or run alongside other workloads.
Jamba 1.5 Mini52BQ4_K_M57.5 GB8,192No measured row yetComfortable fit with 62% headroom — room to extend context or run alongside other workloads.
Borderline
5 models · tight, may need quant downgrade
ModelParamsQuantVRAM est.ContextEvidenceNote
DBRX Base132BQ4_K_M144.8 GB8,192No measured row yetTight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
DBRX Instruct132BAWQ-INT4144.8 GB8,192No measured row yetTight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Mistral Large 2 (123B)123BQ4_K_M138.2 GB8,192No measured row yetTight fit at Q4_K_M — only 8% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Nemotron 3 Super (120B-A12B)120BQ4_K_M135.6 GB8,192No measured row yetTight fit at Q4_K_M — only 10% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Llama 4 Scout109BQ5_K_M136.4 GB8,192No measured row yetTight fit at Q5_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Not practical
16 models · oversize for this build
ModelParamsQuantVRAM est.ContextEvidenceNote
Mixtral 8x22B Instruct141BQ4_K_M158.7 GB8,192No measured row yet~158.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 6%. Drop quant or move to a larger build.
WizardLM-2 8x22B141BQ4_K_M158.7 GB8,192No measured row yet~158.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 6%. Drop quant or move to a larger build.
GLM-5 Pro144BAWQ-INT4158.1 GB8,192No measured row yet~158.1 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 5%. Drop quant or move to a larger build.
GLM-5200BQ4_K_M226 GB8,192No measured row yet~226.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 51%. Drop quant or move to a larger build.
Kimi K1.5200BAWQ-INT4220.8 GB8,192No measured row yet~220.8 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 47%. Drop quant or move to a larger build.
Qwen 3 235B-A22B235BQ4_K_M266.6 GB8,192No measured row yet~266.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 78%. Drop quant or move to a larger build.
DeepSeek Coder V2 236B236BQ4_K_M258.7 GB8,192No measured row yet~258.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 72%. Drop quant or move to a larger build.
DeepSeek V2.5 236B236BQ4_K_M258.7 GB8,192No measured row yet~258.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 72%. Drop quant or move to a larger build.
K-EXAONE 236B A23B236BQ4_K_M254.3 GB8,192No measured row yet~254.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 70%. Drop quant or move to a larger build.
Llama 3.1 Nemotron Ultra 253B253BQ4_K_M277.7 GB8,192No measured row yet~277.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 85%. Drop quant or move to a larger build.
DeepSeek V4 Flash (284B MoE)284BQ4_K_M312.1 GB8,192No measured row yet~312.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build.
Hunyuan Large 389B MoE389BQ4_K_M425.5 GB8,192No measured row yet~425.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 184%. Drop quant or move to a larger build.
Qwen 3.5 235B-A17B (MoE)397BQ4_K_M435.8 GB8,192No measured row yet~435.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 191%. Drop quant or move to a larger build.
Jamba 1.5 Large398BQ4_K_M440.5 GB8,192No measured row yet~440.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 194%. Drop quant or move to a larger build.
Llama 4 Maverick400BQ4_K_M452 GB8,192No measured row yet~452.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 201%. Drop quant or move to a larger build.
Llama 4 405B405BAWQ-INT4444 GB8,192No measured row yet~444.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 196%. Drop quant or move to a larger build.

Related

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

See something off?Submit a benchmark·Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.