Execution layer · L3

Stacks

Executable architectures. Each stack ties model + runtime + memory + protocol + hardware into one shipping recipe — with hardware budget, setup commands, expected outcome, and the operator-grade reasoning behind every pick. Where the catalog tells you what exists and the operational reviews tell you whether to use it, the stacks tell you how to actually assemble the thing.

Stack · L3·Workstation tier·7 components

Build a local coding-agent stack (May 2026)

A coding agent that drafts diffs, runs tests, and edits files autonomously — entirely on your hardware, with persistent memory of the codebase.

RTX 4090 24GB or M3 Max 64GB · 64GB+ system RAM · 1TB NVMe · Linux or macOS
Stack · L3·Workstation tier·7 components

Build an RTX 4090 AI workstation stack (May 2026)

A general-purpose AI workstation built around a single RTX 4090 24GB — runs a 32B-class coding model, a 14B chat model, and serves agent workloads to a small team on the same box.

RTX 4090 24GB · 64GB DDR5 · 1TB NVMe · 850W PSU · Linux (Ubuntu 24.04 LTS recommended) or Windows 11 with WSL2
Stack · L3·Workstation tier·7 components

Build a Mac-native AI stack (May 2026)

A Mac-native local AI stack that takes full advantage of unified memory and (optionally) scales across multiple Macs via Thunderbolt 5 — runs 32B-class models comfortably on a single Mac, frontier-class models across a cluster.

M3 Max 64GB+ unified memory (or M4 Pro / M4 Max) · macOS 14.5+ (26.2+ for RDMA) · 1TB+ SSD · optional second Mac via Thunderbolt 5 for clustering
Stack · L3·Workstation tier·5 components

Build an offline RAG workstation stack (May 2026)

Search and chat with thousands of private documents on a single workstation that never phones home — for legal, medical, financial, or any data class that legally cannot leave the network.

RTX 4090 24GB or M3 Max 64GB · 64GB+ system RAM · 2TB NVMe (documents + vector store) · Linux with no internet egress allowed by default
Stack · L3·Production tier·4 components

Build a distributed inference homelab stack (May 2026)

Run 70B-405B class models across 2-4 GPU machines on a controlled LAN. Real interconnect requirements; real monitoring; real failure modes. The path beyond 'just buy a bigger card.'

2-4x GPU nodes (each: 1-2x A100 80GB or RTX 5090 32GB · 256GB+ RAM · 2TB NVMe) · 100Gbps Ethernet or InfiniBand interconnect · dedicated head node for Ray · managed switch · UPS recommended
Stack · L3·Workstation tier·8 components

Build a memory-enabled local agent stack (May 2026)

An agent that takes a task, remembers what happened in prior sessions, retrieves relevant context from prior decisions, and avoids re-discovering the same dead ends — all on local hardware with no data leaving the machine.

RTX 4090 24GB (or 5090 32GB) · 64GB+ system RAM · 1TB NVMe · Linux preferred · PostgreSQL on the same box for the SQL-MCP path
Stack · L3·Homelab tier·5 components

Build a 16GB VRAM local AI stack (May 2026)

A useful local AI workstation on a 16GB VRAM card (RTX 4060 Ti 16GB, RTX 4080 Super, RTX 5070, M-class Apple Silicon with 24GB+ unified memory). Daily-driver quality at the budget tier without trying to pretend it's a 4090.

RTX 4060 Ti 16GB / RTX 4080 Super 16GB / RTX 5070 12-16GB / RX 7800 XT 16GB · 32GB+ system RAM · 1TB NVMe · Linux / Windows / macOS
Stack · L3·Workstation tier·6 components

Build a local reasoning-model stack (May 2026)

Run a reasoning-class model locally for math, code synthesis, multi-step analysis, and long-horizon problem-solving. Honest about the reasoning-token cost (extra 200-2000 tokens per query) and the hardware requirements that follow.

RTX 4090 24GB (minimum) or 5090 32GB / M3-M4 Max 64GB · 64GB+ system RAM · 1TB NVMe · long-context-friendly thermal envelope (sustained generation runs hot)
Stack · L3·Workstation tier·6 components

Build a local vision-model stack (May 2026)

Run a vision-language model locally for image understanding, document Q&A over screenshots, OCR-plus-reasoning, and visual analysis tasks. All processing on your hardware; images never leave the network.

RTX 4090 24GB / 5090 32GB / M3-M4 Max 64GB (vision tokens consume more VRAM than text) · 64GB+ system RAM · 1TB NVMe (image cache + model weights)
Stack · L3·Production tier·4 components

Build a multi-machine Apple Silicon cluster (May 2026)

Run frontier-class models (DeepSeek V3, Llama 4 Maverick) locally on a personal-affordable Apple Silicon cluster. Honest about what works (Thunderbolt 5 RDMA, Exo's pipeline parallel) and what doesn't (NVIDIA-only frameworks, training workloads, multi-tenant serving).

4-8x M4 Pro Mac Mini (each: 64GB unified memory recommended, 32GB minimum) · macOS 26.2+ on every node (RDMA prerequisite) · Thunderbolt 5 daisy-chain or hub · 10GbE backup network · 1500W UPS
Stack · L3·Workstation tier·7 components

Build a fully offline coding stack (May 2026)

An autonomous coding agent that runs entirely on a workstation with no outbound network egress. Pre-staged models, audited dependency chain, network-monitored verification — the stack that holds up to real air-gap audits.

RTX 4090 24GB · 64GB+ system RAM · 2TB NVMe · Linux (Ubuntu 24.04 LTS recommended) with iptables egress block · pre-staged models on local mirror
Stack · L3·Workstation tier·7 components

Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs

Build a 70B-class local-AI workstation for under $2,500 total system cost. Run Llama 3.3 70B / Qwen 2.5 72B / DeepSeek R1 Distill Llama 70B at AWQ-INT4 with 8K context and serve 4-8 concurrent agent loops via vLLM tensor-parallel-2.

2× RTX 3090 24GB (used) · NVLink 3-slot bridge · ATX motherboard with 2× PCIe 4.0 x16 slots (e.g. ASRock Rack ROMED8-2T or MSI X670E Carbon) · 1000W Platinum PSU · 64GB DDR5 · 1TB NVMe · Ubuntu 22.04 LTS or 24.04 LTS · 3-slot or open-frame chassis with intake + inter-card cooling
Stack · L3·Production tier·7 components

Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink

Build a new-card 70B-class AI workstation with FP8 support and ~30% faster inference than dual-3090, accepting that NVLink absence costs ~10-15% of theoretical tensor-parallel performance.

2× RTX 4090 24GB · ATX motherboard with 2× PCIe 4.0/5.0 x16 slots + adequate spacing for triple-slot cards (workstation board recommended: TRX50, W790, X670E Carbon WiFi) · 1500W Platinum PSU · 64-128GB DDR5 · 1TB NVMe · Ubuntu 24.04 LTS · Open-frame or large server chassis with riser cables for slot-2 placement
Stack · L3·Homelab tier·7 components

Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling

Build a homelab inference rig capable of 100B-class dense or large MoE models on 4× used RTX 3090s. Total system cost $3,500-5,000. Plan for basement/server-room placement; 1400W under load.

4× RTX 3090 24GB (used) · NVLink 3-slot bridges (paired: cards 0-1 and 2-3) · WRX80 / W790 workstation motherboard with 4× PCIe 4.0 x16 slots · 1600W Platinum PSU OR dual 1000W PSUs synced via Add2PSU · 128GB DDR5 ECC · 2TB NVMe · Open-frame mining rig OR 4U server chassis · Inter-card cooling fans · Ubuntu 24.04 LTS
Stack · L3·Homelab tier·6 components

Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path

For users who already own a 4090 OR a 3090 and want to add the other card without selling the existing one. Run 70B-class models via llama.cpp layer-split with manual --tensor-split tuning. Document the asymmetric tradeoffs honestly.

1× RTX 4090 24GB · 1× RTX 3090 24GB · ATX motherboard with 2× PCIe 4.0 x16 slots · 1300W Platinum PSU · 64-128GB DDR5 · 1TB NVMe · Ubuntu 24.04 LTS · Open-frame chassis OR ATX tower with riser cable for slot 2
Stack · L3·Production tier·7 components

4× H100 SXM tensor-parallel workstation — frontier MoE serving reference

Deploy frontier-MoE serving (Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick) at organizational scale. High-concurrency multi-tenant inference with vLLM tensor-parallel-4 and SGLang RadixAttention.

DGX-H100 8-GPU chassis OR equivalent 4× H100 SXM build (Supermicro AS-4124GO-NART, ASUS ESC8000-E11) · 4× H100 80GB SXM5 with NVLink-Switch · 2TB DDR5 ECC · 8TB+ NVMe (model weights + checkpoints) · 25 GbE or InfiniBand for cluster expansion · datacenter rack with 30 kW power + tier-3 cooling · Ubuntu 22.04 LTS with NVIDIA enterprise driver stack
Stack · L3·Homelab tier·6 components

iPhone on-device AI stack — Llama 3.2 3B / Phi-3.5 Mini via MLX Swift

Ship an iOS app that runs a 3B-class LLM on-device for one of: summarization, classification, on-the-edge voice processing, or offline tutoring. Battery-aware enough to pass App Store review. Honest about thermal throttling.

iPhone 15 Pro or newer (A17 Pro+ Neural Engine) · iPad M4 (38 TOPS NPU + 120 GB/s memory bandwidth) for tablet-tier · iOS 17.4+ for MLX Swift Apple Intelligence-class deployment · Mac with Xcode 15.4+ for the build/sign toolchain · ~5 GB Mac storage for the model + Xcode caches
Stack · L3·Homelab tier·8 components

Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub

Ship an Android app that runs a 3-4B-class LLM on-device for summarization, classification, on-the-edge voice processing, or offline-first features. Battery + thermal aware. Decide between MLC LLM (cross-device) vs Qualcomm AI Hub (Snapdragon-locked) honestly.

Snapdragon 8 Gen 3 / 8 Elite (Hexagon NPU, 45-80 TOPS INT8) for the NPU path · or any modern Adreno-equipped device for the MLC LLM GPU path · Android 13+ for ExecuTorch / MLC LLM compatibility · Android Studio + NDK for build/sign · ~5GB workstation storage for the model + Android Studio caches

What a stack is (and isn't)

A stack is the answer to the question “what do I actually install, in what order, on what hardware, to get this specific outcome.” It pins five layers — model, runtime, memory (KV cache + RAG store), protocol (API surface, MCP, agent framework), and hardware — into a single recipe with concrete versions, commands, and the expected end state. The stack assumes you know what each layer does in the abstract and shows you which specific picks compose into a working system.

It is not a generic listicle. There is no “top-10-things-you-might-want” energy. A stack either runs on the hardware it specifies or it doesn't; a stack either produces the named outcome or it doesn't. Stacks that don't hold up under that test get rewritten, deprecated, or marked “needs reproduction.”

How stacks relate to the rest of the site

The model catalog and tool catalog tell you what exists. The hardware verdicts tell you which hardware is worth your money. The operator paths tell you the order to learn things in. Stacks tell you the assembled architecture for one specific use case — what model + what runtime + what hardware + what wrapper code. When a buyer guide recommends a card, that card shows up under “Featured in stacks” on its verdict page — stacks are the glue between the catalog and the deployed system.

Editorial discipline

Every stack pick on these pages carries a one-line operator note that explains why this and not the obvious alternative. That's the difference between a recipe that compounds authority and a recipe that's a glorified affiliate list. Stacks are graph nodes — every component picked here adds an incoming edge to a catalog entity. Suggestions for the next stack we should write? Drop us a line.