Stack · L3 execution·Workstation tier

Build a local reasoning-model stack (May 2026)

Run a reasoning-class model locally for math, code synthesis, multi-step analysis, and long-horizon problem-solving. Honest about the reasoning-token cost (extra 200-2000 tokens per query) and the hardware requirements that follow.

By Fredoline Eruo·Last reviewed 2026-05-06·~12 min read
The stack
  1. 01
    ModelPrimary reasoning model
    deepseek-r1-distill-qwen-32b

    DeepSeek R1 Distill Qwen 32B is the reasoning model that actually runs at 24GB VRAM via AWQ-INT4. Stronger reasoning quality per parameter than the full DeepSeek R1 (which needs ~700GB and is impossible locally). Distill gives ~80% of R1's reasoning at 5% of the VRAM.

  2. 02
    ModelAlternative reasoning model (Qwen-team)
    qwq-32b

    QwQ 32B is the Qwen team's open reasoning model. Slightly different reasoning style than DeepSeek R1 — pick QwQ when you want shorter reasoning blocks and faster wall-clock answers; pick DeepSeek R1 Distill when you want longer, more thorough reasoning.

  3. 03
    ModelGeneral model with reasoning toggle
    qwen-3-32b

    Qwen 3 32B has a reasoning-mode toggle (the <think> block convention) that you can enable per-query. Useful when most of your workload doesn't need reasoning — fall back to standard mode for chat, enable thinking for math / code / analysis.

  4. 04
    ToolInference engine
    vllm

    vLLM over Ollama for reasoning models: continuous batching matters because reasoning-token emissions are long (a single query can emit 5000+ tokens). Prefix caching helps when batch reasoning-mode queries share system prompts. KV-cache management matters more here than on chat models.

  5. 05
    ToolFrontend with reasoning-block rendering
    openwebui

    Open WebUI renders <think> blocks as collapsible reasoning sections — the right UX for reasoning models. The user sees the conclusion first, can expand to inspect the reasoning. Cleaner than a wall of thinking tokens.

  6. 06
    HardwareGPU (minimum tier for 32B AWQ + 32K context)
    rtx-4090

    RTX 4090 24GB is the floor. 32B AWQ + 32K context fits with ~2GB headroom — enough for reasoning-block emission but tight. The 5090 32GB is the comfortable tier; M3 Max 64GB / M4 Max are credible alternatives via MLX-LM.

Why reasoning models change the calculus

Reasoning models — DeepSeek R1 family, QwQ, Qwen 3 in thinking mode — emit <think> blocks before their actual answers. These blocks contain the model's internal chain of thought, often 200-2000 tokens of intermediate reasoning that's never shown to the user. The architectural reality this stack respects: reasoning models cost 2-5x more tokens per query than chat models, and the latency budget shifts accordingly.

For some workloads, this is dramatically worth it. Math problems, multi-step code synthesis, complex analysis tasks — reasoning models often beat chat models of the same parameter count by 20-40% on these specific benchmarks. For other workloads (chat, simple code edits, summarization), the reasoning-token tax is pure overhead and a chat model is the right pick.

The headline architectural choice this stack makes: 32B-class distilled reasoning models, not the frontier full models. The full DeepSeek R1 needs ~700GB of weights — impossible locally without a multi-machine cluster. The Distill Qwen 32B variant captures ~80% of the reasoning quality at 5% of the VRAM, fits an RTX 4090 in AWQ-INT4, and runs at 30-40 tok/s. That's the realistic local-reasoning-model tier.

Step-by-step setup

1. Bring up vLLM with DeepSeek R1 Distill Qwen 32B

# AWQ-INT4 fits 24GB with 32K context — but with reasoning-block
# emission, KV cache fills quickly. Conservative settings:
docker run --gpus all --rm -d --name vllm \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --restart unless-stopped \
  vllm/vllm-openai:v0.17.1 \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enforce-eager

--gpu-memory-utilization 0.85 rather than 0.9 because reasoning-block emission produces longer outputs than chat models — the KV cache needs more headroom. The --enforce-eager flag avoids CUDA graph compilation issues that some R1 distill versions trigger.

2. Optional — also load Qwen 3 32B for reasoning-toggle workflows

# Qwen 3 32B has a reasoning-mode toggle. Run as a second vLLM
# instance on a different port if you have headroom (or swap it in
# when needed):
docker run --gpus all --rm -d --name vllm-qwen3 \
  -p 8001:8000 \
  vllm/vllm-openai:v0.17.1 \
  --model Qwen/Qwen3-32B-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill

# NOTE: only run this if you have a 5090 (32GB) or larger.
# Two 32B AWQ models do not fit on a 4090.

3. Wire Open WebUI as the reasoning-aware frontend

docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  ghcr.io/open-webui/open-webui:latest

Open WebUI renders <think> blocks as collapsible sections — the user sees the conclusion first; can expand to inspect the reasoning. This UX pattern is what makes reasoning models actually usable in chat. Without it, the user sees a wall of thinking tokens before the answer.

4. Configure for reasoning-aware sampling

# Reasoning models have specific sampling recommendations:
# - temperature 0.6-0.8 (higher than chat for diverse reasoning paths)
# - top_p 0.95
# - presence_penalty 0 (don't penalize reasoning-token repetition)
# - frequency_penalty 0

# In Open WebUI, set these as workspace defaults for the reasoning
# models. The UI exposes all four sliders.

# For code synthesis tasks specifically, lower temperature to 0.2
# during the answer block while keeping it 0.8 during the reasoning
# block — this is a manual workflow vLLM doesn't support natively;
# OpenWebUI's per-message-temperature feature is the workaround.

The reasoning-token tax

The cost reasoning models impose on every query, with honest numbers:

  • Trivial questions (“what's the capital of France”): reasoning models still emit 100-300 thinking tokens. Pure overhead; use a chat model.
  • Standard chat / explanation: 300-800 thinking tokens. Marginal benefit vs chat models; sometimes worth it for the polish.
  • Math problems / step-by-step analysis: 500-1500 thinking tokens. Strong benefit vs chat models on correctness.
  • Complex code synthesis / architecture decisions: 1000-3000 thinking tokens. Substantial benefit; this is where reasoning models earn their tax.
  • Multi-step planning / proof construction: 2000-5000+ thinking tokens. The tier where reasoning models genuinely outperform anything else available locally.

The cost in wall-clock time on RTX 4090: each thinking token costs ~25-30ms (the same throughput as answer tokens). A query with 2000 thinking tokens adds ~50-60 seconds before the actual answer starts streaming. Plan UX accordingly — show progress; never block-wait silently.

Failure modes you'll hit

  1. Reasoning blocks leak into structured output. Some clients parse model output as JSON; the <think> block breaks the parse. Strip thinking tokens before structured-output parsing, or instruct the model to skip reasoning when emitting JSON.
  2. Context-window exhaustion on long reasoning. Complex tasks can emit 5000+ thinking tokens. With 32K context and a 4K input prompt, that leaves ~23K tokens for reasoning + answer. Most queries fit; pathological cases don't. Use a reasoning model with longer context if you hit this regularly.
  3. OOM on KV cache during reasoning. KV cache scales with output length. Long thinking blocks blow past VRAM budgets sized for chat-class outputs. Set conservativegpu_memory_utilization (0.85 not 0.9).
  4. QwQ vs DeepSeek R1 reasoning style mismatch. QwQ's reasoning is shorter and more direct; DeepSeek R1's is more thorough. Switching between them mid- conversation produces inconsistent UX. Pick one; stick with it per workflow.
  5. Sampler config drift. Reasoning models are more sensitive to sampler parameters than chat models. A temperature of 1.0 (chat default) often produces incoherent reasoning. Use 0.6-0.8.
  6. Tool-calling format confusion. Reasoning models trained on chain-of-thought sometimes emit reasoning inside tool-call JSON, breaking the parse. Newer reasoning-tuned models handle this; older ones don't. Test with your specific tool-calling client.
  7. Premature stopping on EOS during reasoning. Some configs treat </think> as a stop token. Verify stop-token list excludes reasoning-block delimiters.

Variations and alternatives

Apple Silicon variation. Replace vLLM + 4090 with MLX-LM + M3-M4 Max 64GB. The unified-memory architecture handles reasoning-block emissions well; long-context throughput stays stable. Pick this when you're Apple-native; expect ~30% throughput drop.

SGLang variation. Replace vLLM with SGLang if you process many reasoning queries with shared system prompts (batch reasoning workflows). RadixAttention's prefix tree compounds reasoning-mode wins.

Higher-VRAM variation. RTX 5090 32GB or dual-RTX-4090 (TP=2) lets you run two reasoning models simultaneously — DeepSeek R1 Distill for thorough reasoning, QwQ for fast reasoning. Switch per query type.

Cloud-API hybrid. Use full DeepSeek R1 (not Distill) via DeepSeek's API for the hardest tasks; fall back to local Distill for the routine ones. Open WebUI's provider abstraction makes the dual-backend pattern natural.

Who should avoid this stack

  • Anyone whose workload is mostly chat or simple tasks. Reasoning-token tax is pure overhead. Use the workstation stack instead.
  • Anyone with strict latency budgets. Reasoning models add 50-300% to wall-clock time. If sub-second response is required, chat models or smaller reasoning models (7B-class) are the only viable option.
  • Anyone on 16GB VRAM. 32B reasoning models don't fit. Drop to 14B-class reasoning (less capable but still useful) or use API.
  • Anyone whose tool-calling client doesn't handle reasoning blocks. Older agent harnesses break on <think> blocks. Verify compatibility before committing.

Going deeper