Stack Builder

Eight inputs — use case, budget, scale, privacy posture — and we compose the full rig: GPU + runtime + 1-3 model picks + first-run workflow + cost rollup + ready-to-paste install script. Three tiers side-by-side so the upgrade path stays visible.

Every recommendation references rule-based scoring; measured tok/s carries a confidence chip when surfaced. We don't invent numbers — when the data isn't there we say so.

Tell us about your build

URL updates as you change fields — share or bookmark a result.

Primary use case?

Budget?

Scale?

Operating system?

Your skill level?

Privacy posture?

Side-by-side: budget vs balanced vs stretch

one step down · your inputs · one step up

Budget

~$350

One step down on budget. What you give up; what you keep.

GPU

NVIDIA RTX 2080 Ti 22GB (China-mod)· 22 GB

Runtime

vLLM

Top model

Qwen 3 14B· Q4_K_M

3-yr TCO

—

Balanced

~$899

Your inputs, our recommendation. Read the full card below.

GPU

NVIDIA GeForce RTX 3090· 24 GB

Runtime

vLLM

Top model

Qwen 3 14B· Q4_K_M

3-yr TCO

$1,046

Break-even

188 mo vs cloud

Stretch

~$2,500

One step up on budget. What you'd gain; what it costs.

GPU

NVIDIA L4· 24 GB

Runtime

vLLM

Top model

Qwen 3 14B· Q4_K_M

3-yr TCO

$2,530

Break-even

1251 mo vs cloud

Your recommended stack

full breakdown — read top to bottom

Balanced — recommended

nvidia24 GB VRAM~$899

NVIDIA GeForce RTX 3090 + vLLM + Qwen 3 14B

§ Hardware

Qwen 2.5 Coder 32B Q4 + 32K context

Expected throughput: 30-60 tok/s on 32B Q4 single-stream; 80-130 tok/s on 13B Q4.

Estimated(rule-based scoring)Full hardware page →

§ Runtime

vLLM

Production multi-user serving needs continuous batching + paged attention — both are vLLM's core advantage over Ollama (which sees ~3-5× lower throughput on multi-user workloads).

›Use --tensor-parallel-size N for multi-GPU setups
›Enable --enable-prefix-caching for chat workloads
›AWQ-INT4 quantization for the best speed/quality balance

§ Model picks (2)

Qwen 3 14B
14B params
Q4_K_M
~8.6 GB
Strongest general-purpose model at 14B in 2026. Multilingual tokenizer (1.7× more efficient on Turkish/Asian languages than Llama). Reasoning mode available.
C
Community-reported·30-45 tok/s on 16GB VRAM
Phi-4 14B
14B params
Q4_K_M
~8.5 GB
Microsoft's reasoning-focused 14B trained on heavy synthetic data. Beats Llama 3.1 8B on math/code benchmarks. Weaker creative writing.
Ed
Editorial·30-45 tok/s on 16GB VRAM

§ First-run workflow

Install vLLM on Linux.
Start vLLM with the model: python -m vllm.entrypoints.openai.api_server --model qwen3-14b --port 8000
Connect a coding agent: install Cline (VS Code extension) or Aider (CLI) and point it at http://localhost:11434 (Ollama) or :8000 (vLLM).
Verify the full loop end-to-end before adding observability, monitoring, or any second model.

§ Total cost of ownership (3-year)

Upfront

$899

hardware

Monthly electricity

at 0.84 kWh/day

3-year total

$1,046

upfront + electricity

Cloud equivalent

$319

same token volume

Break-even

188 mo

local beats cloud

Tok/s assumed

112.3

extrapolated

Assumptions: 4 hr/day active, 60% utilization, 3-year amortization, $0.30/M token cloud equivalent. Tune in cost calculator →

Install script

copy and paste — this gets you to first token

#!/usr/bin/env bash
# RunLocalAI stack installer — vLLM production serving
set -e

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model qwen3-14b \
  --port 8000 \
  --enable-prefix-caching \
  --tensor-parallel-size 1   # bump to N for multi-GPU

Where to go from here

GPU chooser →

Just the hardware-pick question, with side-by-side compare, price/perf scatter, and score breakdown per dimension.

Custom build engine →

Reverse direction: I have this hardware — what fits? Use this to validate the recommendation against your actual rig.

Quant Advisor →

Drill into the model picks: Q4_K_M vs Q5_K_M vs Q8 on your specific VRAM, with quality curve + VRAM fit visualization.

TCO calculator →

Tune every assumption: utilization, electricity rate, cloud equivalent rate, amortization horizon.

Curated stacks →

18 hand-curated stack recipes for specific outcomes (coding agent, offline RAG, dual-3090, Mac cluster, iPhone, etc.)

Apps directory →

37 curated apps that plug into the runtime + model picks above: chat UIs, coding agents, RAG, voice, image, browser ext, mobile, editor plugins. Filter by your VRAM + privacy posture.