RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Quick answers
REF
  • All buyer guides
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
← Home

>Stack Builder

Eight inputs — use case, budget, scale, privacy posture — and we compose the full rig: GPU + runtime + 1-3 model picks + first-run workflow + cost rollup + ready-to-paste install script. Three tiers side-by-side so the upgrade path stays visible.

Every recommendation references rule-based scoring; measured tok/s carries a confidence chip when surfaced. We don't invent numbers — when the data isn't there we say so.

Tell us about your build

URL updates as you change fields — share or bookmark a result.

Side-by-side: budget vs balanced vs stretch

one step down · your inputs · one step up
Budget
~$350

One step down on budget. What you give up; what you keep.

GPU
NVIDIA RTX 2080 Ti 22GB (China-mod)· 22 GB
Runtime
vLLM
Top model
Qwen 3 14B· Q4_K_M
3-yr TCO
—
Balanced
~$899

Your inputs, our recommendation. Read the full card below.

GPU
NVIDIA GeForce RTX 3090· 24 GB
Runtime
vLLM
Top model
Qwen 3 14B· Q4_K_M
3-yr TCO
$1,046
Break-even
188 mo vs cloud
Stretch
~$2,500

One step up on budget. What you'd gain; what it costs.

GPU
NVIDIA L4· 24 GB
Runtime
vLLM
Top model
Qwen 3 14B· Q4_K_M
3-yr TCO
$2,530
Break-even
1251 mo vs cloud

Your recommended stack

full breakdown — read top to bottom
Balanced — recommended
nvidia24 GB VRAM~$899

NVIDIA GeForce RTX 3090 + vLLM + Qwen 3 14B

§ Hardware
Qwen 2.5 Coder 32B Q4 + 32K context
Expected throughput: 30-60 tok/s on 32B Q4 single-stream; 80-130 tok/s on 13B Q4.
·Estimated(rule-based scoring)Full hardware page →
§ Runtime
vLLM

Production multi-user serving needs continuous batching + paged attention — both are vLLM's core advantage over Ollama (which sees ~3-5× lower throughput on multi-user workloads).

  • ›Use --tensor-parallel-size N for multi-GPU setups
  • ›Enable --enable-prefix-caching for chat workloads
  • ›AWQ-INT4 quantization for the best speed/quality balance
§ Model picks (2)
  • Qwen 3 14B
    14B params
    Q4_K_M
    ~8.6 GB

    Strongest general-purpose model at 14B in 2026. Multilingual tokenizer (1.7× more efficient on Turkish/Asian languages than Llama). Reasoning mode available.

    C
    Community-reported·30-45 tok/s on 16GB VRAM
  • Phi-4 14B
    14B params
    Q4_K_M
    ~8.5 GB

    Microsoft's reasoning-focused 14B trained on heavy synthetic data. Beats Llama 3.1 8B on math/code benchmarks. Weaker creative writing.

    Ed
    Editorial·30-45 tok/s on 16GB VRAM
§ First-run workflow
  1. Install vLLM on Linux.
  2. Start vLLM with the model: python -m vllm.entrypoints.openai.api_server --model qwen3-14b --port 8000
  3. Connect a coding agent: install Cline (VS Code extension) or Aider (CLI) and point it at http://localhost:11434 (Ollama) or :8000 (vLLM).
  4. Verify the full loop end-to-end before adding observability, monitoring, or any second model.
§ Total cost of ownership (3-year)
Upfront
$899
hardware
Monthly electricity
$4
at 0.84 kWh/day
3-year total
$1,046
upfront + electricity
Cloud equivalent
$319
same token volume
Break-even
188 mo
local beats cloud
Tok/s assumed
112.3
extrapolated
Assumptions: 4 hr/day active, 60% utilization, 3-year amortization, $0.30/M token cloud equivalent. Tune in cost calculator →

Install script

copy and paste — this gets you to first token
#!/usr/bin/env bash
# RunLocalAI stack installer — vLLM production serving
set -e

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model qwen3-14b \
  --port 8000 \
  --enable-prefix-caching \
  --tensor-parallel-size 1   # bump to N for multi-GPU

Where to go from here

GPU chooser →

Just the hardware-pick question, with side-by-side compare, price/perf scatter, and score breakdown per dimension.

Custom build engine →

Reverse direction: I have this hardware — what fits? Use this to validate the recommendation against your actual rig.

Quant Advisor →

Drill into the model picks: Q4_K_M vs Q5_K_M vs Q8 on your specific VRAM, with quality curve + VRAM fit visualization.

TCO calculator →

Tune every assumption: utilization, electricity rate, cloud equivalent rate, amortization horizon.

Curated stacks →

18 hand-curated stack recipes for specific outcomes (coding agent, offline RAG, dual-3090, Mac cluster, iPhone, etc.)

Apps directory →

37 curated apps that plug into the runtime + model picks above: chat UIs, coding agents, RAG, voice, image, browser ext, mobile, editor plugins. Filter by your VRAM + privacy posture.