Custom build engine

Describe your build — any GPUs, CPU, RAM, OS, runtime, use case. We'll compute effective VRAM honestly, recommend a runtime, and tell you which models fit comfortably, which are borderline, and which aren't practical.

Total VRAM ≠ pooled VRAM. We never sum VRAM unless the silicon truly pools (Apple unified memory). We always explain why effective is lower than total.

Calculations follow the RunLocalAI Will-It-Run Framework: effective VRAM, model working set, runtime constraints, fit tiers, and measured-vs-estimated evidence labels.

Describe your build

Add GPUs, set CPU/RAM/OS, optionally pick a runtime + use case. URL updates as you change fields — share a build by copying the URL.

GPUs in your build

CPU class

System RAM (GB)

Use case

Your skill level

Runtime preference

Build summary

Total VRAM

0 GB

Effective VRAM

~0 GB

range 0-0 GB

Topology

apple unified

unified-memory · pooled

Setup difficulty

beginner

speed penalty ~0%

Genuinely pooled memory

Apple Silicon unified memory IS genuinely pooled — 0 GB shared between CPU and GPU. macOS reserves ~8 GB for the OS and apps, leaving ~0 GB for inference. Unlike NVIDIA multi-GPU, you don't pay an interconnect penalty here. The trade is bandwidth: 800 GB/s vs an RTX 4090's 1 TB/s.

Measured evidence on this hardware

Publicly inspectable measured rows for the selected hardware slug(s). Exact measured rows calibrate the fit table instead of leaving it as pure VRAM estimation.

No publicly inspectable benchmark rows are attached to this exact hardware yet. The engine will still calculate fit and runtime, but speed rows will remain estimated.

Recommended runtime

Best engine for this topology + skill level + use case.

MLX-LM

primary

moderate

Apple's first-party inference path; native Metal kernels outperform llama.cpp's Metal backend by 15-30% on most decoder-only models.

llama.cpp

alternative

moderate

Cross-platform fallback if you need GGUF format or want to share the same checkpoint across Mac and Linux/Windows.

WORKLOAD PROFILE

OVERFLOW

all-MiniLM-L6-v2 @ Q4_K_M, 0.3K context on Apple Mac Studio (M3 Ultra)

0 GB0.0 GBVRAM ceiling

Weights0.0 GB

KV cache0.0 GB

Activations0.0 GB

Runtime0.5 GB

Overflow0.5 GB

ESTIMATED DECODE RATE

44000 tok/s

Bandwidth-derived estimate · efficiency 0.55. Real-world rates land within ±20% on well-tuned runtimes.

Models that fit your build

315 models considered. Categorized by headroom at the recommended quant + a sensible context for your use case.

Comfortable

0 models · ≥15% headroom

No model fits comfortably on this build.

Borderline

0 models · tight, may need quant downgrade

No borderline models — clean fit ladder.

Not practical

16 models · oversize for this build

Model	Params	Quant	VRAM est.	Context	Evidence	Note
all-MiniLM-L6-v2	0B	Q4_K_M	0 GB	256	No measured row yet	~0.0 GB needed at Q4_K_M + 256 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Piper	0B	Q4_K_M	0 GB	0	No measured row yet	~0.0 GB needed at Q4_K_M + 0 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Whisper Tiny	0B	Q4_K_M	0 GB	30	No measured row yet	~0.0 GB needed at Q4_K_M + 30 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Whisper Base	0B	Q4_K_M	0 GB	30	No measured row yet	~0.0 GB needed at Q4_K_M + 30 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Kokoro 82M	0B	Q4_K_M	0.1 GB	0	No measured row yet	~0.1 GB needed at Q4_K_M + 0 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
all-mpnet-base-v2	0B	Q4_K_M	0.1 GB	384	No measured row yet	~0.1 GB needed at Q4_K_M + 384 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
paraphrase-multilingual-MiniLM-L12-v2	0B	Q4_K_M	0.1 GB	128	No measured row yet	~0.1 GB needed at Q4_K_M + 128 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
gpt2-base-french	0B	Q4_K_M	0.1 GB	1,024	No measured row yet	~0.1 GB needed at Q4_K_M + 1,024 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
GPT-2 Spanish	0B	Q4_K_M	0.1 GB	1,024	No measured row yet	~0.1 GB needed at Q4_K_M + 1,024 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
SmolLM2 135M Instruct	0B	Q4_K_M	0.2 GB	8,192	No measured row yet	~0.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Nomic Embed Text v1.5	0B	Q4_K_M	0.2 GB	8,192	No measured row yet	~0.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
GTE ModernBERT Base	0B	Q4_K_M	0.2 GB	8,192	No measured row yet	~0.2 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Dostoevsky Doesn't Write It GPT2	0B	Q4_K_M	0.1 GB	1,024	No measured row yet	~0.1 GB needed at Q4_K_M + 1,024 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Whisper Small	0B	Q4_K_M	0.2 GB	30	No measured row yet	~0.2 GB needed at Q4_K_M + 30 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Gemma 3 270M	0B	Q4_K_M	0.3 GB	8,192	No measured row yet	~0.3 GB needed at Q4_K_M + 8,192 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.
Jina Reranker v2 Base Multilingual	0B	Q4_K_M	0.2 GB	1,024	No measured row yet	~0.2 GB needed at Q4_K_M + 1,024 ctx — overshoots effective VRAM by Infinity%. Drop quant or move to a larger build.

Multi-GPU buying guide →

NVLink vs PCIe, tensor- vs pipeline-parallel, mixed-card honesty.

Hardware combinations →

Curated multi-GPU / cluster setups with effective-VRAM math.

Setup path-finder →

OS + runtime install commands for your stack.

Compatibility matrix →

Runtime × OS × hardware support truth table.

Shopping a full build instead of a single card?

If you're sizing a fresh AI build (not just a card to drop into an existing system), the build-budget walkthroughs cover the whole BOM honestly: AI PC build under $1,000 or AI PC build under $2,000 cover the realistic 2026 budget tiers.

Vertical-fit shopping? AI PC for students covers the budget + portability tradeoffs; AI PC for developers covers the coding workflow specifics; AI PC for small business covers the document-RAG / always-on machine.

Form-factor first? See best laptop for local AI, best Mac for local AI, best mini PC for local AI, or best used GPU for local AI.