RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
← Back to Will-it-run

Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

Build summary

Total VRAM
320 GB
Effective VRAM
~266 GB
range 251-273 GB
Topology
single node multi gpu
pcie
Setup difficulty
advanced
speed penalty ~18%
Why effective VRAM is lower than total

4× NVIDIA H100 SXM = 320 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/4 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~266 GB. Largest single tensor on one card is ~78 GB.

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM
primary
involved

Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.

ExLlamaV2
alternative
involved

Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.

llama.cpp
alternative
moderate

Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.

Models that fit your build

183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable
24 models · ≥15% headroom
ModelParamsQuantVRAM est.ContextNote
Mixtral 8x22B Instruct141BQ4_K_M169.9 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom.
WizardLM-2 8x22B141BQ4_K_M169.9 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom.
DBRX Base132BQ4_K_M159.7 GB8,192Fits cleanly at Q4_K_M + 8,192 ctx with 40% headroom.
DBRX Instruct132BAWQ-INT4214.6 GB8,192Fits cleanly at AWQ-INT4 + 8,192 ctx with 19% headroom.
Mistral Large 2 (123B)123BQ4_K_M149.5 GB8,192Comfortable fit with 44% headroom — room to extend context or run alongside other workloads.
Nemotron 3 Super (120B-A12B)120BQ4_K_M146.1 GB8,192Comfortable fit with 45% headroom — room to extend context or run alongside other workloads.
Llama 4 Scout109BQ5_K_M143.2 GB8,192Comfortable fit with 46% headroom — room to extend context or run alongside other workloads.
Command R+ (Aug 2024)104BAWQ-INT4171.2 GB8,192Fits cleanly at AWQ-INT4 + 8,192 ctx with 36% headroom.
Command R+ 104B104BQ4_K_M127.9 GB8,192Comfortable fit with 52% headroom — room to extend context or run alongside other workloads.
Llama 3.2 90B Vision Instruct90BQ4_K_M112 GB8,192Comfortable fit with 58% headroom — room to extend context or run alongside other workloads.
Llama 3.2 90B Vision90BAWQ-INT4149.5 GB8,192Comfortable fit with 44% headroom — room to extend context or run alongside other workloads.
InternVL 2.5 78B78BQ4_K_M98.4 GB8,192Comfortable fit with 63% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Math 72B72BQ4_K_M69.5 GB4,096Comfortable fit with 74% headroom — room to extend context or run alongside other workloads.
Qwen 3 72B72BAWQ-INT4121.6 GB8,192Comfortable fit with 54% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 72B Instruct72BQ5_K_M98 GB8,192Comfortable fit with 63% headroom — room to extend context or run alongside other workloads.
Molmo 72B72BQ4_K_M69.5 GB4,096Comfortable fit with 74% headroom — room to extend context or run alongside other workloads.
Qwen 2.5-VL 72B72BAWQ-INT4121.6 GB8,192Comfortable fit with 54% headroom — room to extend context or run alongside other workloads.
Hermes 3 Llama 3.1 70B70BQ4_K_M89.4 GB8,192Comfortable fit with 66% headroom — room to extend context or run alongside other workloads.
Hermes 4 Llama 3.3 70B70BAWQ-INT4118.5 GB8,192Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
Llama 3.3 70B Instruct70BQ8_090.8 GB8,192Comfortable fit with 66% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Llama 70B70BQ5_K_M95.5 GB8,192Comfortable fit with 64% headroom — room to extend context or run alongside other workloads.
Dolphin 3 Llama 3.3 70B70BAWQ-INT4118.5 GB8,192Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
EVA Llama 3.3 70B70BAWQ-INT4118.5 GB8,192Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
Llama 4 70B70BAWQ-INT4118.5 GB8,192Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
Borderline
2 models · tight, may need quant downgrade
ModelParamsQuantVRAM est.ContextNote
GLM-5200BQ4_K_M236.8 GB8,192Tight fit at Q4_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
GLM-5 Pro144BAWQ-INT4233.2 GB8,192Tight fit at AWQ-INT4 — only 12% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Not practical
16 models · oversize for this build
ModelParamsQuantVRAM est.ContextNote
Kimi K1.5200BAWQ-INT4320 GB8,192~320.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 20%. Drop quant or move to a larger build.
Qwen 3 235B-A22B235BQ4_K_M276.5 GB8,192~276.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build.
DeepSeek Coder V2 236B236BQ4_K_M277.6 GB8,192~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build.
DeepSeek V2.5 236B236BQ4_K_M277.6 GB8,192~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build.
Llama 3.1 Nemotron Ultra 253B253BQ4_K_M296.9 GB8,192~296.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 12%. Drop quant or move to a larger build.
DeepSeek V4 Flash (284B MoE)284BQ4_K_M332 GB8,192~332.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 25%. Drop quant or move to a larger build.
Hunyuan Large 389B MoE389BQ4_K_M451.1 GB8,192~451.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 70%. Drop quant or move to a larger build.
Qwen 3.5 235B-A17B (MoE)397BQ4_K_M460.2 GB8,192~460.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 73%. Drop quant or move to a larger build.
Jamba 1.5 Large398BQ4_K_M461.3 GB8,192~461.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 73%. Drop quant or move to a larger build.
Llama 4 Maverick400BQ4_K_M463.6 GB8,192~463.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 74%. Drop quant or move to a larger build.
Llama 4 405B405BAWQ-INT4637.7 GB8,192~637.7 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 140%. Drop quant or move to a larger build.
DeepSeek V3 (671B MoE)671BQ4_K_M770.9 GB8,192~770.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 190%. Drop quant or move to a larger build.
DeepSeek R1 (671B reasoning)671BQ4_K_M770.9 GB8,192~770.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 190%. Drop quant or move to a larger build.
Mistral Medium 3.5 (675B MoE)675BQ4_K_M775.4 GB8,192~775.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 192%. Drop quant or move to a larger build.
DeepSeek V4745BAWQ-INT41164.7 GB8,192~1164.7 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 338%. Drop quant or move to a larger build.
Kimi K2.61000BQ4_K_M1143.9 GB8,192~1143.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 330%. Drop quant or move to a larger build.

Related

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

See something off?Submit a benchmark·Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.