Custom build engine — what local AI can your hardware run?

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

GPUs in your build

CPU class

System RAM (GB)

OS

Use case

Your skill level

Runtime preference

Build summary

Total VRAM

192 GB

Effective VRAM

~153 GB

range 144-156 GB

Topology

single node multi gpu

pcie

Setup difficulty

advanced

speed penalty ~18%

Why effective VRAM is lower than total

8× NVIDIA GeForce RTX 4090 = 192 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/8 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~153 GB. Largest single tensor on one card is ~22 GB.

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM

primary

involved

Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.

ExLlamaV2

alternative

involved

Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.

llama.cpp

alternative

moderate

Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.

Models that fit your build

183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable

24 models · ≥15% headroom

Model	Params	Quant	VRAM est.	Context	Note
Command R+ 104B	104B	Q4_K_M	127.9 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 16% headroom.
Llama 3.2 90B Vision Instruct	90B	Q4_K_M	112 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 27% headroom.
InternVL 2.5 78B	78B	Q4_K_M	98.4 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom.
Qwen 2.5 Math 72B	72B	Q4_K_M	69.5 GB	4,096	Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
Qwen 3 72B	72B	AWQ-INT4	121.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 21% headroom.
Qwen 2.5 72B Instruct	72B	Q5_K_M	98 GB	8,192	Fits cleanly at Q5_K_M + 8,192 ctx with 36% headroom.
Molmo 72B	72B	Q4_K_M	69.5 GB	4,096	Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
Qwen 2.5-VL 72B	72B	AWQ-INT4	121.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 21% headroom.
Hermes 3 Llama 3.1 70B	70B	Q4_K_M	89.4 GB	8,192	Comfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Hermes 4 Llama 3.3 70B	70B	AWQ-INT4	118.5 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom.
Llama 3.3 70B Instruct	70B	Q8_0	90.8 GB	8,192	Comfortable fit with 41% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Llama 70B	70B	Q5_K_M	95.5 GB	8,192	Fits cleanly at Q5_K_M + 8,192 ctx with 38% headroom.
Dolphin 3 Llama 3.3 70B	70B	AWQ-INT4	118.5 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom.
EVA Llama 3.3 70B	70B	AWQ-INT4	118.5 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom.
Llama 4 70B	70B	AWQ-INT4	118.5 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 23% headroom.
Llama 3.1 Nemotron 70B Instruct	70B	Q4_K_M	89.4 GB	8,192	Comfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Llama 3.1 70B Instruct	70B	Q5_K_M	95.5 GB	8,192	Fits cleanly at Q5_K_M + 8,192 ctx with 38% headroom.
OpenBioLLM Llama 3 70B	70B	Q4_K_M	89.4 GB	8,192	Comfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Tulu 3 70B	70B	Q4_K_M	89.4 GB	8,192	Comfortable fit with 42% headroom — room to extend context or run alongside other workloads.
Jamba 1.5 Mini	52B	Q4_K_M	69 GB	8,192	Comfortable fit with 55% headroom — room to extend context or run alongside other workloads.
Nemotron 3 Super 49B	49B	AWQ-INT4	85.9 GB	8,192	Comfortable fit with 44% headroom — room to extend context or run alongside other workloads.
Mixtral 8x7B Instruct	47B	Q5_K_M	67.4 GB	8,192	Comfortable fit with 56% headroom — room to extend context or run alongside other workloads.
Aya 23 35B	35B	Q4_K_M	49.7 GB	8,192	Comfortable fit with 68% headroom — room to extend context or run alongside other workloads.
Command R 35B	35B	Q4_K_M	49.7 GB	8,192	Comfortable fit with 68% headroom — room to extend context or run alongside other workloads.

Borderline

4 models · tight, may need quant downgrade

Model	Params	Quant	VRAM est.	Context	Note
Mistral Large 2 (123B)	123B	Q4_K_M	149.5 GB	8,192	Tight fit at Q4_K_M — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Nemotron 3 Super (120B-A12B)	120B	Q4_K_M	146.1 GB	8,192	Tight fit at Q4_K_M — only 5% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Llama 4 Scout	109B	Q5_K_M	143.2 GB	8,192	Tight fit at Q5_K_M — only 6% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Llama 3.2 90B Vision	90B	AWQ-INT4	149.5 GB	8,192	Tight fit at AWQ-INT4 — only 2% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.

Not practical

16 models · oversize for this build

Model	Params	Quant	VRAM est.	Context	Note
Command R+ (Aug 2024)	104B	AWQ-INT4	171.2 GB	8,192	~171.2 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 12%. Drop quant or move to a larger build.
DBRX Base	132B	Q4_K_M	159.7 GB	8,192	~159.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 4%. Drop quant or move to a larger build.
DBRX Instruct	132B	AWQ-INT4	214.6 GB	8,192	~214.6 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 40%. Drop quant or move to a larger build.
Mixtral 8x22B Instruct	141B	Q4_K_M	169.9 GB	8,192	~169.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 11%. Drop quant or move to a larger build.
WizardLM-2 8x22B	141B	Q4_K_M	169.9 GB	8,192	~169.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 11%. Drop quant or move to a larger build.
GLM-5 Pro	144B	AWQ-INT4	233.2 GB	8,192	~233.2 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 52%. Drop quant or move to a larger build.
GLM-5	200B	Q4_K_M	236.8 GB	8,192	~236.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 55%. Drop quant or move to a larger build.
Kimi K1.5	200B	AWQ-INT4	320 GB	8,192	~320.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 109%. Drop quant or move to a larger build.
Qwen 3 235B-A22B	235B	Q4_K_M	276.5 GB	8,192	~276.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 81%. Drop quant or move to a larger build.
DeepSeek Coder V2 236B	236B	Q4_K_M	277.6 GB	8,192	~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 81%. Drop quant or move to a larger build.
DeepSeek V2.5 236B	236B	Q4_K_M	277.6 GB	8,192	~277.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 81%. Drop quant or move to a larger build.
Llama 3.1 Nemotron Ultra 253B	253B	Q4_K_M	296.9 GB	8,192	~296.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 94%. Drop quant or move to a larger build.
DeepSeek V4 Flash (284B MoE)	284B	Q4_K_M	332 GB	8,192	~332.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 117%. Drop quant or move to a larger build.
Hunyuan Large 389B MoE	389B	Q4_K_M	451.1 GB	8,192	~451.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 195%. Drop quant or move to a larger build.
Qwen 3.5 235B-A17B (MoE)	397B	Q4_K_M	460.2 GB	8,192	~460.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 201%. Drop quant or move to a larger build.
Jamba 1.5 Large	398B	Q4_K_M	461.3 GB	8,192	~461.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 202%. Drop quant or move to a larger build.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

Describe your build

Build summary

Recommended runtime

Models that fit your build

Related

Shopping a full build instead of a single card?