Eight inputs — use case, budget, scale, privacy posture — and we compose the full rig: GPU + runtime + 1-3 model picks + first-run workflow + cost rollup + ready-to-paste install script. Three tiers side-by-side so the upgrade path stays visible.
Every recommendation references rule-based scoring; measured tok/s carries a confidence chip when surfaced. We don't invent numbers — when the data isn't there we say so.
URL updates as you change fields — share or bookmark a result.
One step down on budget. What you give up; what you keep.
Your inputs, our recommendation. Read the full card below.
One step up on budget. What you'd gain; what it costs.
Production multi-user serving needs continuous batching + paged attention — both are vLLM's core advantage over Ollama (which sees ~3-5× lower throughput on multi-user workloads).
--tensor-parallel-size N for multi-GPU setups--enable-prefix-caching for chat workloadsStrongest general-purpose model at 14B in 2026. Multilingual tokenizer (1.7× more efficient on Turkish/Asian languages than Llama). Reasoning mode available.
Microsoft's reasoning-focused 14B trained on heavy synthetic data. Beats Llama 3.1 8B on math/code benchmarks. Weaker creative writing.
python -m vllm.entrypoints.openai.api_server --model qwen3-14b --port 8000http://localhost:11434 (Ollama) or :8000 (vLLM).#!/usr/bin/env bash
# RunLocalAI stack installer — vLLM production serving
set -e
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model qwen3-14b \
--port 8000 \
--enable-prefix-caching \
--tensor-parallel-size 1 # bump to N for multi-GPUJust the hardware-pick question, with side-by-side compare, price/perf scatter, and score breakdown per dimension.
Reverse direction: I have this hardware — what fits? Use this to validate the recommendation against your actual rig.
Drill into the model picks: Q4_K_M vs Q5_K_M vs Q8 on your specific VRAM, with quality curve + VRAM fit visualization.
Tune every assumption: utilization, electricity rate, cloud equivalent rate, amortization horizon.
18 hand-curated stack recipes for specific outcomes (coding agent, offline RAG, dual-3090, Mac cluster, iPhone, etc.)
37 curated apps that plug into the runtime + model picks above: chat UIs, coding agents, RAG, voice, image, browser ext, mobile, editor plugins. Filter by your VRAM + privacy posture.