Stack · L3 execution·Workstation tier

Build an RTX 4090 AI workstation stack (May 2026)

A general-purpose AI workstation built around a single RTX 4090 24GB — runs a 32B-class coding model, a 14B chat model, and serves agent workloads to a small team on the same box.

By Fredoline Eruo·Last reviewed 2026-05-06·~11 min read
The stack
  1. 01
    HardwareGPU (the hardware that defines this stack)
    rtx-4090

    24GB VRAM is the first-class consumer tier in May 2026 — 4080 16GB doesn't have headroom for 32K context on 32B models; 5090 helps but is 2-3x the price for ~30% more throughput. The 4090 stays the sweet spot until 5090 supply normalises.

  2. 02
    ToolInference engine (production-grade serving)
    vllm

    vLLM over Ollama for the production-serving role on this box — continuous batching matters when 3-5 users hit the same model concurrently, and the OpenAI-compatible endpoint makes Open WebUI / AnythingLLM / OpenHands plug in without adapter code. Keep Ollama installed alongside for ad-hoc model swaps.

  3. 03
    ToolModel-swap layer (ad-hoc experiments)
    ollama

    Ollama lives next to vLLM, not as competition: it owns the 'I want to try a new model right now' surface. One-line model pulls beat re-rendering vLLM Docker configs every time. Run on a different port (11434) to avoid clashes.

  4. 04
    ModelCoding model (32B class)
    qwen-2.5-coder-32b-instruct

    Qwen 2.5 Coder 32B AWQ-INT4 is the strongest model that fits 24GB with real context room — beats DeepSeek Coder V2 Lite on coding benchmarks at the same VRAM budget. Reserve 8-10GB of VRAM for KV cache; 32K context is the sweet spot.

  5. 05
    ModelChat model (low-latency general-purpose)
    qwen-3-14b

    Qwen 3 14B at FP16 fits with massive headroom; serves chat, summaries, and tool-call workloads at 60+ tok/s with single-digit-ms TTFT on warm prefix. The right default when you don't need coding-class reasoning.

  6. 06
    ToolTeam chat frontend
    openwebui

    Open WebUI over AnythingLLM for the chat-frontend role on a workstation: better multi-user ergonomics, cleaner pipelines for tool calls. AnythingLLM wins for RAG-first workspaces; Open WebUI wins when you want a polished chat UI for a small team.

  7. 07
    ToolRAG workspace frontend
    anythingllm

    Pairs with Open WebUI on the same box — different roles. AnythingLLM owns the 'chat with my documents' workflow; Open WebUI owns 'chat with the model directly.' Each runs as its own Docker container and points at the same vLLM endpoint.

Why this stack on this hardware

The 4090's 24GB VRAM creates a specific architectural window. 16GB cards cannot run a 32B-class coding model with real context room; 48GB+ cards (L40S / 6000 Ada / 5090 paired) shift you into a different cost tier. The 4090 sits in the middle and rewards a stack that respects both constraints — fits the model AND keeps headroom for batch serving across multiple users.

The headline architectural choice this stack makes: vLLM and Ollama coexist on the same machine, serving different workflows. Most guides treat them as competitors; on a 24GB workstation they're complementary — vLLM owns the “production endpoint we serve to frontends” role; Ollama owns the “I just want to try this model right now” role. Different ports, different lifecycle expectations, different model rotation rates.

Step-by-step setup

1. Bring up vLLM as the production endpoint

# Run vLLM on port 8000 — the production-facing endpoint
docker run --gpus all --rm -d --name vllm \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --restart unless-stopped \
  vllm/vllm-openai:v0.17.1 \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill

Note --gpu-memory-utilization 0.85 rather than 0.9 — leaving 15% (3.6GB) headroom for Ollama to coexist on the same card. Ollama doesn't pre-reserve VRAM the way vLLM does, so the same allocation that worked at 0.9 in a single-runtime setup will OOM the moment Ollama loads something.

2. Add Ollama on a different port for ad-hoc work

# Ollama on its default port (11434) — no Docker, native install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a chat model that's distinct from the coding model in vLLM
ollama pull qwen3:14b

# Verify both runtimes are alive on different ports
curl http://localhost:8000/v1/models   # vLLM (Qwen Coder 32B)
curl http://localhost:11434/api/tags   # Ollama (Qwen 3 14B)

Both can run concurrently because they hold different model weights — vLLM has Qwen Coder loaded; Ollama loads Qwen 3 14B on demand and unloads it when idle. Total VRAM under load: ~22GB; idle Ollama drops to ~12GB.

3. Wire Open WebUI as the team frontend

docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  -e ENABLE_OLLAMA_API=true \
  -e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
  ghcr.io/open-webui/open-webui:latest

Open WebUI sees both endpoints and presents them as model options. Users pick “Qwen Coder 32B (vLLM)” for coding tasks and “Qwen 3 14B (Ollama)” for chat. Same UI; different runtimes; the model switcher is transparent.

4. Add AnythingLLM for the RAG-workspace surface

docker run -d --name anythingllm \
  -p 3001:3001 \
  --restart unless-stopped \
  --cap-add SYS_ADMIN \
  -v anythingllm-storage:/app/server/storage \
  -e LLM_PROVIDER="generic-openai" \
  -e GENERIC_OPEN_AI_BASE_PATH="http://host.docker.internal:8000/v1" \
  -e GENERIC_OPEN_AI_MODEL_PREF="Qwen/Qwen2.5-Coder-32B-Instruct-AWQ" \
  mintplexlabs/anythingllm

AnythingLLM gets the same vLLM endpoint as Open WebUI — they share the model, isolate the workspace. Different role: Open WebUI for direct chat, AnythingLLM for “chat with my documents.” Both alive on different ports (3000 and 3001).

OS-level tuning that actually matters

The configuration that affects throughput on this stack (and the configuration that doesn't):

  • NVIDIA driver >= 555. Older drivers have FlashAttention-2 kernel selection bugs that silently halve vLLM throughput. Run nvidia-smi --query-gpu=driver_version --format=csv to verify.
  • nvidia-persistenced running. Without it, the GPU re-initializes on every CUDA context create, adding 100-300ms to first-token latency on cold starts. Enable with sudo systemctl enable --now nvidia-persistenced.
  • NVMe scheduler set to none. The default mq-deadline scheduler adds latency on model load. echo none | sudo tee /sys/block/nvme0n1/queue/scheduler or pin in /etc/udev/rules.d/.
  • System RAM 64GB minimum. The OS file cache holds the model weights between vLLM cold starts; 32GB systems re-read from disk on every restart, adding 20-40 seconds to startup.
  • Power limit set to 350W for thermal sustainability under continuous load. The 4090's 450W TDP is fine for bursts but reduces card lifetime in sustained inference. nvidia-smi -pl 350 sets it; pin in a systemd service to make it persistent.

What does NOT meaningfully matter on this stack: PCIe bifurcation (single-card workloads don't care), RAM frequency above DDR5-5600 (CPU-side bandwidth isn't the bottleneck), CPU core count above 8 (vLLM threading is GPU-bound).

Failure modes you'll hit

  1. OOM when both vLLM and Ollama load. vLLM at--gpu-memory-utilization 0.9 + Ollama loading a 14B model = OOM. Drop vLLM to 0.85 (recommended above) or run Ollama with OLLAMA_KEEP_ALIVE=0 so it unloads aggressively.
  2. Open WebUI can't see vLLM models. host.docker.internal doesn't resolve on Linux by default. Either run with --add-host=host.docker.internal:host-gateway or use the host network mode (--network=host).
  3. Coil whine on light load. 4090s sing audibly under low-utilization GPU load. Power-limiting to 350W usually cures it; if it doesn't, the card is within spec (Nvidia's position) but you may want to RMA. Most stack-builders accept it as the price of consumer-tier hardware.
  4. Thermal throttling at 30+ minutes of sustained load. Stock 4090 cooling handles bursts but a tight chassis with one 120mm exhaust will hit 87°C and throttle. Verify with nvidia-smi --query-gpu=temperature.gpu --format=csv -l 1 during a long generation; add chassis fans or undervolt if it climbs past 80°C.
  5. Open WebUI persistent-volume corruption. Killing the container during write can corrupt the SQLite db. Mitigate with --restart unless-stopped (above) and an explicit volume backup before any docker rm.

Variations and alternatives

5090 swap. If you have a 5090, the architectural shape doesn't change — same vLLM + Ollama + frontends pattern. You get ~30% more throughput and FP4 support; not enough to justify the price-jump on its own, but if you already own one, no reconfiguration needed.

Multi-GPU 4090 variation. 2x 4090 with NVLink isn't a thing on consumer SKUs (NVIDIA disabled NVLink on Ada consumer); 2x over PCIe loses 30-40% of throughput to interconnect bandwidth. Usually not worth it unless the model genuinely won't fit. See /systems/distributed-inference for the math.

SGLang variation. If your team workflow is heavy on agent loops with stable system prompts (10+ tool calls per task on a fixed prefix), SGLang can replace vLLM for 1.3-1.7x aggregate throughput. The frontends and Ollama side stay the same.

Linux vs Windows-WSL2. Both work. Linux is ~5% faster on raw throughput due to lower CUDA driver overhead and better filesystem performance for the model cache. WSL2 catches up on most workloads; the only place it regresses meaningfully is rapid model-swap workflows where the file cache matters most.

Going deeper