Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

GPUs in your build

CPU class

System RAM (GB)

Use case

Your skill level

Runtime preference

Build summary

Total VRAM

13824 GB

Effective VRAM

~13823 GB

range 12716-13822 GB

Topology

single gpu

none

Setup difficulty

beginner

speed penalty ~0%

Why effective VRAM is lower than total

Single NVIDIA GB200 NVL72 — 13824 GB VRAM minus ~1.5 GB runtime overhead = ~13822 GB usable for weights + KV cache + activations. The 8% headroom we reserve covers the typical OS/driver footprint and gives KV-cache room for an 8K-32K context.

Coding agents — agentic tool-call burst

Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.

Bottleneck: kv cache

Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.

Budget for

•32K context with KV-cache room to spare (~3-4 GB on 4090 AWQ-INT4)
•Prefix cache: prefer SGLang for >5 tool calls / task
•Decode latency: aim for >40 tok/s sustained

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM

primary

involved

AWQ-INT4 path fits 32B-class models on a 24 GB card with concurrent users. The production-default for self-hosted coding agents and multi-user serving.

ExLlamaV2

alternative

involved

Single-stream throughput king on consumer NVIDIA. EXL2 4.65bpw on a 4090 hits the highest tok/s in this class.

Models that fit your build

47 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable

24 models · ≥15% headroom

Model	Params	Quant	VRAM est.	Context	Note
DeepSeek V4 Pro (1.6T MoE)	1600B	Q4_K_M	4248.9 GB	32,768	Comfortable fit with 69% headroom — room to extend context or run alongside other workloads.
Kimi K2.6	1000B	Q4_K_M	2668.5 GB	32,768	Comfortable fit with 81% headroom — room to extend context or run alongside other workloads.
Ring-2.6-1T	1000B	FP16	4134.6 GB	32,768	Comfortable fit with 70% headroom — room to extend context or run alongside other workloads.
DeepSeek V4	745B	AWQ-INT4	2306.8 GB	32,768	Comfortable fit with 83% headroom — room to extend context or run alongside other workloads.
DeepSeek V4 Flash (284B MoE)	284B	Q5_K_M	807.6 GB	32,768	Comfortable fit with 94% headroom — room to extend context or run alongside other workloads.
DeepSeek Coder V2 236B	236B	Q4_K_M	656.2 GB	32,768	Comfortable fit with 95% headroom — room to extend context or run alongside other workloads.
DeepSeek V2.5 236B	236B	Q4_K_M	656.2 GB	32,768	Comfortable fit with 95% headroom — room to extend context or run alongside other workloads.
Qwen 3 235B-A22B	235B	Q5_K_M	674.2 GB	32,768	Comfortable fit with 95% headroom — room to extend context or run alongside other workloads.
Llama 4 Scout	109B	FP16	481.5 GB	32,768	Comfortable fit with 97% headroom — room to extend context or run alongside other workloads.
Qwen 3 72B	72B	AWQ-INT4	254.2 GB	32,768	Comfortable fit with 98% headroom — room to extend context or run alongside other workloads.
Llama 4 70B	70B	AWQ-INT4	248.1 GB	32,768	Comfortable fit with 98% headroom — room to extend context or run alongside other workloads.
Llama 3.3 70B Instruct	70B	Q8_0	123.4 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Llama 3.1 70B Instruct	70B	Q5_K_M	225.1 GB	32,768	Comfortable fit with 98% headroom — room to extend context or run alongside other workloads.
Qwen 3.6 35B-A3B (MTP)	35B	FP16	178.1 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Phind CodeLlama 34B v2	34B	Q4_K_M	73.7 GB	16,384	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
DeepSeek Coder V3	33B	AWQ-INT4	135.2 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 32B Instruct	32B	Q8_0	134.3 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Qwen 3 32B	32B	Q8_0	134.3 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 32B Instruct	32B	Q8_0	78.9 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
DeepSeek R1 Distill Qwen 3 32B	32B	AWQ-INT4	132.2 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Qwen 3 Coder 32B	32B	AWQ-INT4	132.2 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Gemma 4 31B Dense	31B	Q8_0	131.2 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Qwen 3 30B-A3B	30B	Q8_0	128 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.
Qwen 3.6 27B (MTP)	27B	FP16	145.3 GB	32,768	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.

Borderline

0 models · tight, may need quant downgrade

No borderline models — clean fit ladder.

Not practical

0 models · oversize for this build

Every considered model fits.

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.