Custom build engine — what local AI can your hardware run?

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

GPUs in your build

CPU class

System RAM (GB)

OS

Use case

Your skill level

Runtime preference

Build summary

Total VRAM

96 GB

Effective VRAM

~76 GB

range 72-78 GB

Topology

single node multi gpu

pcie

Setup difficulty

advanced

speed penalty ~18%

Why effective VRAM is lower than total

4× NVIDIA GeForce RTX 3090 = 96 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/4 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~76 GB. Largest single tensor on one card is ~22 GB.

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM

primary

involved

Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.

ExLlamaV2

alternative

involved

Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.

llama.cpp

alternative

moderate

Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.

Models that fit your build

183 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable

24 models · ≥15% headroom

Model	Params	Quant	VRAM est.	Context	Note
Llama 3.3 70B Instruct	70B	Q5_K_M	63.2 GB	8,192	Fits cleanly at Q5_K_M + 8,192 ctx with 17% headroom.
Aya 23 35B	35B	Q4_K_M	49.7 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 35% headroom.
Command R 35B	35B	Q4_K_M	49.7 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 35% headroom.
Phind CodeLlama 34B v2	34B	Q4_K_M	48.5 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom.
Yi 1.5 34B	34B	Q4_K_M	48.5 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 36% headroom.
DeepSeek Coder V3	33B	AWQ-INT4	61.1 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 20% headroom.
Qwen 2.5 32B Instruct	32B	Q8_0	61.7 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 19% headroom.
DeepSeek R1 Distill Qwen 3 32B	32B	AWQ-INT4	59.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
QwQ 32B Preview	32B	Q4_K_M	46.3 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom.
EXAONE 3.5 32B	32B	AWQ-INT4	59.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
OLMo 2 32B	32B	Q4_K_M	46.3 GB	8,192	Fits cleanly at Q4_K_M + 8,192 ctx with 39% headroom.
Aya Expanse 32B	32B	AWQ-INT4	59.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
Qwen 2.5 Coder 32B Instruct	32B	Q8_0	47.8 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 37% headroom.
Qwen 3 32B	32B	Q8_0	61.7 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 19% headroom.
Magistral 32B	32B	AWQ-INT4	59.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
DeepSeek R1 Distill Qwen 32B	32B	Q8_0	61.7 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 19% headroom.
Qwen 3 Coder 32B	32B	AWQ-INT4	59.6 GB	8,192	Fits cleanly at AWQ-INT4 + 8,192 ctx with 22% headroom.
Gemma 4 31B Dense	31B	Q8_0	60.1 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 21% headroom.
Qwen 3 30B-A3B	30B	Q8_0	58.5 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 23% headroom.
Nemotron 3 Nano (30B-A3B)	30B	Q8_0	58.5 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 23% headroom.
Gemma 3 27B	27B	Q8_0	53.6 GB	8,192	Fits cleanly at Q8_0 + 8,192 ctx with 29% headroom.
MedGemma 27B	27B	Q4_K_M	40.6 GB	8,192	Comfortable fit with 47% headroom — room to extend context or run alongside other workloads.
Gemma 4 26B MoE	26B	Q4_K_M	39.5 GB	8,192	Comfortable fit with 48% headroom — room to extend context or run alongside other workloads.
InternVL 2.5 26B	26B	Q4_K_M	39.5 GB	8,192	Comfortable fit with 48% headroom — room to extend context or run alongside other workloads.

Borderline

4 models · tight, may need quant downgrade

Model	Params	Quant	VRAM est.	Context	Note
Qwen 2.5 Math 72B	72B	Q4_K_M	69.5 GB	4,096	Tight fit at Q4_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Molmo 72B	72B	Q4_K_M	69.5 GB	4,096	Tight fit at Q4_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Jamba 1.5 Mini	52B	Q4_K_M	69 GB	8,192	Tight fit at Q4_K_M — only 9% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Mixtral 8x7B Instruct	47B	Q5_K_M	67.4 GB	8,192	Tight fit at Q5_K_M — only 11% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.

Not practical

16 models · oversize for this build

Model	Params	Quant	VRAM est.	Context	Note
Nemotron 3 Super 49B	49B	AWQ-INT4	85.9 GB	8,192	~85.9 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build.
Hermes 3 Llama 3.1 70B	70B	Q4_K_M	89.4 GB	8,192	~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build.
Hermes 4 Llama 3.3 70B	70B	AWQ-INT4	118.5 GB	8,192	~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build.
DeepSeek R1 Distill Llama 70B	70B	Q4_K_M	89.4 GB	8,192	~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build.
Dolphin 3 Llama 3.3 70B	70B	AWQ-INT4	118.5 GB	8,192	~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build.
EVA Llama 3.3 70B	70B	AWQ-INT4	118.5 GB	8,192	~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build.
Llama 4 70B	70B	AWQ-INT4	118.5 GB	8,192	~118.5 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 56%. Drop quant or move to a larger build.
Llama 3.1 Nemotron 70B Instruct	70B	Q4_K_M	89.4 GB	8,192	~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build.
Llama 3.1 70B Instruct	70B	Q4_K_M	89.4 GB	8,192	~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build.
OpenBioLLM Llama 3 70B	70B	Q4_K_M	89.4 GB	8,192	~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build.
Tulu 3 70B	70B	Q4_K_M	89.4 GB	8,192	~89.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 18%. Drop quant or move to a larger build.
Qwen 3 72B	72B	AWQ-INT4	121.6 GB	8,192	~121.6 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 60%. Drop quant or move to a larger build.
Qwen 2.5 72B Instruct	72B	Q4_K_M	91.6 GB	8,192	~91.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 21%. Drop quant or move to a larger build.
Qwen 2.5-VL 72B	72B	AWQ-INT4	121.6 GB	8,192	~121.6 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 60%. Drop quant or move to a larger build.
InternVL 2.5 78B	78B	Q4_K_M	98.4 GB	8,192	~98.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 30%. Drop quant or move to a larger build.
Llama 3.2 90B Vision Instruct	90B	Q4_K_M	112 GB	8,192	~112.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 47%. Drop quant or move to a larger build.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.

Describe your build

Build summary

Recommended runtime

Models that fit your build

Related

Shopping a full build instead of a single card?