Custom build engine — what local AI can your hardware run?

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

GPUs in your build

CPU class

System RAM (GB)

OS

Use case

Your skill level

Runtime preference

Build summary

Total VRAM

16 GB

Effective VRAM

~14 GB

range 13-14 GB

Topology

single gpu

none

Setup difficulty

beginner

speed penalty ~0%

Why effective VRAM is lower than total

Single NVIDIA GeForce RTX 3080 16GB (Mobile) — 16 GB VRAM minus ~1.8 GB runtime/driver overhead = ~14 GB usable for weights + KV cache + activations. The remaining uncertainty band covers OS display use and background CUDA allocations.

Measured evidence on this hardware

Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.

Speed rows

12 measured

Model	Evidence	Quant	Tok/s	Provenance
Turkcell LLM 7B v1 NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	85.77	Measured hererun logindependent reproduction pending
RefinedNeuro RN TR R2 NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	79.27	Measured hererun logindependent reproduction pending
RefinedNeuro RN TR R1 NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	79.89	Measured hererun logindependent reproduction pending
Qwen 3 4B NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	103.7	Measured hererun logindependent reproduction pending
Qwen 3 14B NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	38.26	Measured hererun logindependent reproduction pending
Qwen 2.5 7B Instruct NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	80.42	Measured hererun logindependent reproduction pending
Phi-4 Reasoning 14B NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	40.44	Measured hererun logindependent reproduction pending
Phi-3.5 Mini Instruct NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	155.4	Measured hererun logindependent reproduction pending
Mistral Nemo 12B Instruct NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	65.72	Measured hererun logindependent reproduction pending
Mistral 7B Instruct v0.3 NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	89.64	Measured hererun logindependent reproduction pending
Llama 3.2 11B Vision Instruct NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	67.00	Measured hererun logindependent reproduction pending
Malhajar Mistral 7B Turkish NVIDIA GeForce RTX 3080 16GB (Mobile)	4K ctx ollama version is 0.24.0 Microsoft Windows [Version 10.0.26200.8457] Driver 571.96	Q4_K_M	87.28	Measured hererun logindependent reproduction pending

Quality rows

9 public

Model	Benchmark	Setup	Score	Log
Qwen 3 8B	HumanEval+	Q4_K_M / ollama-0.24	2.4	gist
Llama 3.1 8B Instruct	MBPP+	Q4_K_M / ollama-0.24	39.2	gist
Phi-4 14B	MBPP+	Q4_K_M / ollama-0.24	60.3	gist
Qwen 2.5 Coder 7B Instruct	MBPP+	Q4_K_M / ollama-0.24	66.9	gist
Llama 3.1 8B Instruct	HumanEval+	Q4_K_M / ollama-0.24	56.1	gist
Phi-4 14B	HumanEval+	Q4_K_M / ollama-0.24	78.7	gist
Qwen 2.5 Coder 7B Instruct	HumanEval+	Q4_K_M / ollama-0.24	81.1	gist
Llama 3.2 3B Instruct	TurkishMMLU (Generative)	Q4_K_M / ollama-0.24	11.4	gist
Turkish Llama 8B Instruct v0.1	TurkishMMLU (Generative)	Q4_K_M / ollama-0.24	11.0	gist

Coding agents — agentic tool-call burst

Workload-specific bottleneck. Where this kind of work actually breaks first, and what to budget for.

Bottleneck: kv cache

Coding agents emit 5-15 tool calls per task. Each call carries the full agent system prompt + context. KV-cache budget for that prompt × concurrent requests is the limit. The decode side is well-served by any modern card; the prefill side bottlenecks first.

Budget for

•32K context with KV-cache room to spare (~3-4 GB on 4090 AWQ-INT4)
•Prefix cache: prefer SGLang for >5 tool calls / task
•Decode latency: aim for >40 tok/s sustained

Recommended runtime

Best engine for this topology + skill level + use case.

vLLM

primary

involved

AWQ-INT4 path fits 32B-class models on a 24 GB card with concurrent users. The production-default for self-hosted coding agents and multi-user serving.

ExLlamaV2

alternative

involved

Single-stream throughput king on consumer NVIDIA. EXL2 4.65bpw on a 4090 hits the highest tok/s in this class.

Models that fit your build

63 models considered (filtered by coding). Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable

19 models · ≥15% headroom

Model	Params	Quant	VRAM est.	Context	Evidence	Note
Yi Coder 9B	9B	Q4_K_M	10.2 GB	8,192	No measured row yet	Fits cleanly at Q4_K_M + 8,192 ctx with 27% headroom.
Llama 3.1 8B Instruct	8B	Q8_0	11.1 GB	16,384	No measured row yet	Fits cleanly at Q8_0 + 16,384 ctx with 21% headroom.
Gervásio 8B PTPT	8B	Q4_K_M	6.6 GB	4,096	No measured row yet	Comfortable fit with 53% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 7B Instruct	7B	Q8_0	9.5 GB	16,384	80.42 tok/sMeasured hereNVIDIA GeForce RTX 3080 16GB (Mobile) / 4K ctx	Measured on this hardware: 80.42 tok/s at Q4_K_M (4,096 measured ctx). Fits cleanly at Q8_0 + 16,384 ctx with 32% headroom.
CodeGemma 7B	7B	Q4_K_M	7.9 GB	8,192	No measured row yet	Comfortable fit with 43% headroom — room to extend context or run alongside other workloads.
Codestral Mamba 7B	7B	Q4_K_M	11.4 GB	16,384	No measured row yet	Fits cleanly at Q4_K_M + 16,384 ctx with 18% headroom.
CodeQwen 1.5 7B	7B	Q4_K_M	11.6 GB	16,384	No measured row yet	Fits cleanly at Q4_K_M + 16,384 ctx with 17% headroom.
StarCoder 2 7B	7B	Q4_K_M	11.6 GB	16,384	No measured row yet	Fits cleanly at Q4_K_M + 16,384 ctx with 17% headroom.
Salamandra 7B	7B	Q4_K_M	7.6 GB	8,192	No measured row yet	Comfortable fit with 46% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 3B	3B	Q4_K_M	8 GB	32,768	No measured row yet	Comfortable fit with 43% headroom — room to extend context or run alongside other workloads.
StarCoder 2 3B	3B	Q4_K_M	5.1 GB	16,384	No measured row yet	Comfortable fit with 63% headroom — room to extend context or run alongside other workloads.
ColPali v1.3	3B	Q4_K_M	1.8 GB	0	No measured row yet	Comfortable fit with 87% headroom — room to extend context or run alongside other workloads.
Salamandra 2B	2B	Q4_K_M	2.4 GB	8,192	No measured row yet	Comfortable fit with 83% headroom — room to extend context or run alongside other workloads.
mxbai-rerank-large-v2	2B	Q4_K_M	4 GB	32,768	No measured row yet	Comfortable fit with 72% headroom — room to extend context or run alongside other workloads.
Qwen 2.5 Coder 1.5B	2B	Q4_K_M	4.1 GB	32,768	No measured row yet	Comfortable fit with 71% headroom — room to extend context or run alongside other workloads.
Florence-2 Large	1B	Q4_K_M	0.4 GB	0	No measured row yet	Comfortable fit with 97% headroom — room to extend context or run alongside other workloads.
GOT-OCR 2.0	1B	Q4_K_M	0.3 GB	0	No measured row yet	Comfortable fit with 98% headroom — room to extend context or run alongside other workloads.
SigLIP SO400M (patch14-384)	0B	Q4_K_M	0.3 GB	0	No measured row yet	Comfortable fit with 98% headroom — room to extend context or run alongside other workloads.
Jina Reranker v2 Base Multilingual	0B	Q4_K_M	0.2 GB	1,024	No measured row yet	Comfortable fit with 99% headroom — room to extend context or run alongside other workloads.

Borderline

7 models · tight, may need quant downgrade

Model	Params	Quant	VRAM est.	Context	Evidence	Note
Qwen 3 14B	14B	Q4_K_M	15.8 GB	8,192	38.26 tok/sMeasured hereNVIDIA GeForce RTX 3080 16GB (Mobile) / 4K ctx	Measured on this hardware: 38.26 tok/s at Q4_K_M (4,096 measured ctx). The larger target context still needs validation before calling it comfortable.
OpenCoder 8B	8B	Q4_K_M	13 GB	16,384	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 3 8B	8B	Q4_K_M	13.1 GB	16,384	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
DeepSeek R1 Distill Llama 8B	8B	Q4_K_M	13 GB	16,384	No measured row yet	Tight fit at Q4_K_M — only 7% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
EXAONE Deep 7.8B	8B	Q4_K_M	12.3 GB	16,384	No measured row yet	Tight fit at Q4_K_M — only 12% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
Qwen 2.5 Coder 7B Instruct	7B	Q6_K	13.6 GB	16,384	No measured row yet	Tight fit at Q6_K — only 3% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.
VibeThinker-3B	3B	FP16	12.3 GB	32,768	No measured row yet	Tight fit at FP16 — only 12% headroom. KV cache for longer context will OOM. Cap context tighter or drop one quant level.

Not practical

16 models · oversize for this build

Model	Params	Quant	VRAM est.	Context	Evidence	Note
Qwen 2.5 14B Instruct	14B	Q4_K_M	16.4 GB	8,192	No measured row yet	~16.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 17%. Drop quant or move to a larger build.
Qwen 2.5 Coder 14B Instruct	14B	Q4_K_M	15.8 GB	8,192	No measured row yet	~15.8 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 13%. Drop quant or move to a larger build.
StarCoder 2 15B	15B	Q4_K_M	17 GB	8,192	No measured row yet	~17.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 21%. Drop quant or move to a larger build.
DeepSeek V3 Lite (16B MoE)	16B	Q4_K_M	18 GB	8,192	No measured row yet	~18.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 28%. Drop quant or move to a larger build.
DeepSeek Coder V2 Lite (16B)	16B	Q4_K_M	18 GB	8,192	No measured row yet	~18.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 28%. Drop quant or move to a larger build.
Codestral 22B	22B	Q4_K_M	24.7 GB	8,192	No measured row yet	~24.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 76%. Drop quant or move to a larger build.
Mistral Small 3 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	~26.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 91%. Drop quant or move to a larger build.
Devstral Small 2 24B	24B	Q4_K_M	26.7 GB	8,192	No measured row yet	~26.7 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 91%. Drop quant or move to a larger build.
Qwen 3.6 27B (MTP)	27B	Q3_K_M	27.3 GB	8,192	No measured row yet	~27.3 GB needed at Q3_K_M + 8,192 ctx — overshoots effective VRAM by 95%. Drop quant or move to a larger build.
Qwen 3 30B-A3B	30B	Q4_K_M	33.9 GB	8,192	No measured row yet	~33.9 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 142%. Drop quant or move to a larger build.
Sarvam 30B	30B	Q4_K_M	24.8 GB	4,096	No measured row yet	~24.8 GB needed at Q4_K_M + 4,096 ctx — overshoots effective VRAM by 77%. Drop quant or move to a larger build.
Gemma 4 31B Dense	31B	Q4_K_M	34.4 GB	8,192	No measured row yet	~34.4 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 146%. Drop quant or move to a larger build.
Qwen 2.5 32B Instruct	32B	Q4_K_M	36 GB	8,192	No measured row yet	~36.0 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build.
DeepSeek R1 Distill Qwen 3 32B	32B	AWQ-INT4	36 GB	8,192	No measured row yet	~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build.
Qwen 3 Coder 32B	32B	AWQ-INT4	36 GB	8,192	No measured row yet	~36.0 GB needed at AWQ-INT4 + 8,192 ctx — overshoots effective VRAM by 157%. Drop quant or move to a larger build.
Qwen 2.5 Coder 32B Instruct	32B	Q4_K_M	22.1 GB	8,192	No measured row yet	~22.1 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by 58%. Drop quant or move to a larger build.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.