Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

GPUs in your build

CPU class

System RAM (GB)

Use case

Your skill level

Runtime preference

Build summary

Total VRAM

48 GB

Effective VRAM

~37 GB

range 35-38 GB

Topology

single node multi gpu

pcie

Setup difficulty

intermediate

speed penalty ~18%

Why effective VRAM is lower than total

2× NVIDIA GeForce RTX 3090 = 48 GB total VRAM, but without NVLink, cross-card bandwidth is PCIe-bound (~32 GB/s vs NVLink ~112 GB/s). With tensor-parallelism, each card holds ~1/2 of the model weights and replicates activations + KV cache. After 15% TP overhead, effective model capacity is ~37 GB. Largest single tensor on one card is ~22 GB.

Measured evidence on this hardware

Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.

No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM

primary

involved

Tensor-parallel across NVLink/PCIe — works on every recent consumer + datacenter pair. AWQ-INT4 + 70B fits dual 3090 / dual 4090 cleanly.

ExLlamaV2

alternative

involved

Single-stream king. EXL2 4.0bpw + 70B fits dual 3090 with NVLink and beats vLLM on solo-user throughput.

llama.cpp

alternative

moderate

Layer-split via --tensor-split is the experimentation-friendly path. Worse throughput than vLLM but easier to debug.

WORKLOAD PROFILE

FITS

Falcon 40B Instruct @ Q4_K_M, 2K context on NVIDIA GeForce RTX 3090

0 GB37 GBVRAM ceiling

Weights22 GB

KV cache5.0 GB

Activations1.1 GB

Runtime1.8 GB

Headroom7.1 GB

ESTIMATED DECODE RATE

30 tok/s

Bandwidth-derived estimate · efficiency 0.70. Real-world rates land within ±20% on well-tuned runtimes.

Models that fit your build

315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable

24 models · ≥15% headroom

Model	Params	Quant	VRAM est.	Context	Evidence	Note
Falcon 40B Instruct	40B	Q4_K_M	28.1 GB	2,048	No measured row yet	Fits cleanly at Q4_K_M + 2,048 ctx with 24% headroom.
Pollux Judge 32B	32B	Q4_K_M	26.5 GB	4,096	No measured row yet	Fits cleanly at Q4_K_M + 4,096 ctx with 28% headroom.
Qwen 2.5 Coder 32B Instruct	32B	Q4_K_M	22.1 GB	8,192	No measured row yet	Comfortable fit with 40% headroom — room to extend context or run alongside other workloads.
Sarvam 30B	30B	Q4_K_M	24.8 GB	4,096	No measured row yet	Fits cleanly at Q4_K_M + 4,096 ctx with 33% headroom.
Gemma 3 27B	27B	Q4_K_M	30.3 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 18% headroom.
MedGemma 27B	27B	Q4_K_M	30.3 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 18% headroom.
InternVL 2.5 26B	26B	Q4_K_M	29.8 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Gemma 4 Turkish 26B (4B active)	26B	Q4_K_M	28 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 24% headroom.
Gemma 4 26B MoE	26B	Q4_K_M	29.8 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 19% headroom.
Mistral Small 3 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
Mistral Medium 3 24B (dense)	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
Dolphin 3.0 Mistral 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
Mistral Saba 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
Mistral Small 3.2 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
Devstral Small 2 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
Sarvam M	24B	Q4_K_M	19.9 GB	4,096	No measured row yet	Comfortable fit with 46% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Mistral 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 28% headroom.
GPT-OSS Swallow 20B RL v0.1	20B	Q4_K_M	21.6 GB	8,192	No measured row yet	Comfortable fit with 42% headroom — room to extend context or run alongside other workloads.
GPT-NeoX 20B	20B	Q4_K_M	14.1 GB	2,048	No measured row yet	Comfortable fit with 62% headroom — room to extend context or run alongside other workloads.
DeepSeek V3 Lite (16B MoE)	16B	Q4_K_M	18 GB	8,192	No measured row yet	Comfortable fit with 51% headroom — room to extend context or run alongside other workloads.
DeepSeek Coder V2 Lite (16B)	16B	Q4_K_M	18 GB	8,192	No measured row yet	Comfortable fit with 51% headroom — room to extend context or run alongside other workloads.
Granite 3 MoE (3B active)	16B	Q4_K_M	18 GB	8,192	No measured row yet	Comfortable fit with 51% headroom — room to extend context or run alongside other workloads.
DeepSeek MoE 16B Base	16B	Q4_K_M	14 GB	4,096	No measured row yet	Comfortable fit with 62% headroom — room to extend context or run alongside other workloads.
DeepSeek V2 Lite Chat	16B	Q4_K_M	16.9 GB	8,192	No measured row yet	Comfortable fit with 54% headroom — room to extend context or run alongside other workloads.

Borderline

16 models · tight, may need quant downgrade

Model	Params	Quant	VRAM est.	Context	Evidence	Note
Qwen 3.6 35B-A3B (MTP)	35B	Q3_K_M	35.4 GB	8,192	No measured row yet	Tight fit at Q3_K_M — only 4% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
DeepSeek Coder V3	33B	AWQ-INT4	36.5 GB	8,192	No measured row yet	Tight fit at AWQ-INT4 — only 1% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
EXAONE 3.5 32B Instruct	32B	Q4_K_M	34.5 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
EXAONE 3.5 32B Instruct AWQ	32B	Q4_K_M	34.5 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 2.5 32B Instruct	32B	Q4_K_M	36 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Magistral 32B	32B	AWQ-INT4	36 GB	8,192	No measured row yet	Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Aya Expanse 32B	32B	AWQ-INT4	36 GB	8,192	No measured row yet	Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
QwQ 32B Preview	32B	Q4_K_M	36 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
DeepSeek R1 Distill Qwen 3 32B	32B	AWQ-INT4	36 GB	8,192	No measured row yet	Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
EXAONE 4.0.1 32B	32B	Q4_K_M	34.5 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 Coder 32B	32B	AWQ-INT4	36 GB	8,192	No measured row yet	Tight fit at AWQ-INT4 — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen3 Swallow 32B RL v0.2	32B	Q4_K_M	34.5 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 32B	32B	Q4_K_M	36 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
DeepSeek R1 Distill Qwen 32B	32B	Q4_K_M	36 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
OLMo 2 32B	32B	Q4_K_M	36 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
llm-jp 4 32B A3B Thinking	32B	Q4_K_M	34.5 GB	8,192	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.

Not practical

16 models · oversize for this build

Model	Params	Quant	VRAM est.	Context	Evidence	Note
Phind CodeLlama 34B v2	34B	Q4_K_M	38 GB	8,192	No measured row yet	~38.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build.
Yi 1.5 34B	34B	Q4_K_M	38 GB	8,192	No measured row yet	~38.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 3%. Drop quant or move to a larger build.
Aya 23 35B	35B	Q4_K_M	39.6 GB	8,192	No measured row yet	~39.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 7%. Drop quant or move to a larger build.
Mihenk LLM v2 35B (Turkish Financial)	35B	Q4_K_M	37.8 GB	8,192	No measured row yet	~37.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 2%. Drop quant or move to a larger build.
Command R 35B	35B	Q4_K_M	39.6 GB	8,192	No measured row yet	~39.6 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 7%. Drop quant or move to a larger build.
ALIA 40b instruct 2601	40B	Q4_K_M	43.1 GB	8,192	No measured row yet	~43.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 17%. Drop quant or move to a larger build.
Mixtral 8X7B Instruct v0.1 GPTQ	47B	Q4_K_M	50.3 GB	8,192	No measured row yet	~50.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 36%. Drop quant or move to a larger build.
Mixtral 8x7B Instruct	47B	Q4_K_M	52.9 GB	8,192	No measured row yet	~52.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 43%. Drop quant or move to a larger build.
Nemotron 3 Super 49B	49B	AWQ-INT4	53.9 GB	8,192	No measured row yet	~53.9 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 46%. Drop quant or move to a larger build.
Jamba 1.5 Mini	52B	Q4_K_M	57.5 GB	8,192	No measured row yet	~57.5 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 55%. Drop quant or move to a larger build.
Dolphin 3 Llama 3.3 70B	70B	AWQ-INT4	77 GB	8,192	No measured row yet	~77.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build.
DeepSeek R1 Distill Llama 70B	70B	Q4_K_M	77 GB	8,192	No measured row yet	~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build.
Llama 3.1 70B Instruct	70B	Q4_K_M	77 GB	8,192	No measured row yet	~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build.
Tulu 3 70B	70B	Q4_K_M	77 GB	8,192	No measured row yet	~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build.
Hermes 3 Llama 3.1 70B	70B	Q4_K_M	77 GB	8,192	No measured row yet	~77.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 108%. Drop quant or move to a larger build.
Llama 3.3 70B Instruct	70B	Q4_K_M	44.7 GB	8,192	No measured row yet	~44.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 21%. Drop quant or move to a larger build.

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.