Stacks
Executable architectures. Each stack ties model + runtime + memory + protocol + hardware into one shipping recipe — with hardware budget, setup commands, expected outcome, and the operator-grade reasoning behind every pick. Where the catalog tells you what exists and the operational reviews tell you whether to use it, the stacks tell you how to actually assemble the thing.
Build a local coding-agent stack (May 2026)
A coding agent that drafts diffs, runs tests, and edits files autonomously — entirely on your hardware, with persistent memory of the codebase.
Build an RTX 4090 AI workstation stack (May 2026)
A general-purpose AI workstation built around a single RTX 4090 24GB — runs a 32B-class coding model, a 14B chat model, and serves agent workloads to a small team on the same box.
Build a Mac-native AI stack (May 2026)
A Mac-native local AI stack that takes full advantage of unified memory and (optionally) scales across multiple Macs via Thunderbolt 5 — runs 32B-class models comfortably on a single Mac, frontier-class models across a cluster.
Build an offline RAG workstation stack (May 2026)
Search and chat with thousands of private documents on a single workstation that never phones home — for legal, medical, financial, or any data class that legally cannot leave the network.
Build a distributed inference homelab stack (May 2026)
Run 70B-405B class models across 2-4 GPU machines on a controlled LAN. Real interconnect requirements; real monitoring; real failure modes. The path beyond 'just buy a bigger card.'
Build a memory-enabled local agent stack (May 2026)
An agent that takes a task, remembers what happened in prior sessions, retrieves relevant context from prior decisions, and avoids re-discovering the same dead ends — all on local hardware with no data leaving the machine.
Build a 16GB VRAM local AI stack (May 2026)
A useful local AI workstation on a 16GB VRAM card (RTX 4060 Ti 16GB, RTX 4080 Super, RTX 5070, M-class Apple Silicon with 24GB+ unified memory). Daily-driver quality at the budget tier without trying to pretend it's a 4090.
Build a local reasoning-model stack (May 2026)
Run a reasoning-class model locally for math, code synthesis, multi-step analysis, and long-horizon problem-solving. Honest about the reasoning-token cost (extra 200-2000 tokens per query) and the hardware requirements that follow.
Build a local vision-model stack (May 2026)
Run a vision-language model locally for image understanding, document Q&A over screenshots, OCR-plus-reasoning, and visual analysis tasks. All processing on your hardware; images never leave the network.
Build a multi-machine Apple Silicon cluster (May 2026)
Run frontier-class models (DeepSeek V3, Llama 4 Maverick) locally on a personal-affordable Apple Silicon cluster. Honest about what works (Thunderbolt 5 RDMA, Exo's pipeline parallel) and what doesn't (NVIDIA-only frameworks, training workloads, multi-tenant serving).
Build a fully offline coding stack (May 2026)
An autonomous coding agent that runs entirely on a workstation with no outbound network egress. Pre-staged models, audited dependency chain, network-monitored verification — the stack that holds up to real air-gap audits.
Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
Build a 70B-class local-AI workstation for under $2,500 total system cost. Run Llama 3.3 70B / Qwen 2.5 72B / DeepSeek R1 Distill Llama 70B at AWQ-INT4 with 8K context and serve 4-8 concurrent agent loops via vLLM tensor-parallel-2.
Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink
Build a new-card 70B-class AI workstation with FP8 support and ~30% faster inference than dual-3090, accepting that NVLink absence costs ~10-15% of theoretical tensor-parallel performance.
Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling
Build a homelab inference rig capable of 100B-class dense or large MoE models on 4× used RTX 3090s. Total system cost $3,500-5,000. Plan for basement/server-room placement; 1400W under load.
Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
For users who already own a 4090 OR a 3090 and want to add the other card without selling the existing one. Run 70B-class models via llama.cpp layer-split with manual --tensor-split tuning. Document the asymmetric tradeoffs honestly.
4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
Deploy frontier-MoE serving (Qwen 3.5 235B-A17B, DeepSeek V4 Pro, Llama 4 Maverick) at organizational scale. High-concurrency multi-tenant inference with vLLM tensor-parallel-4 and SGLang RadixAttention.
iPhone on-device AI stack — Llama 3.2 3B / Phi-3.5 Mini via MLX Swift
Ship an iOS app that runs a 3B-class LLM on-device for one of: summarization, classification, on-the-edge voice processing, or offline tutoring. Battery-aware enough to pass App Store review. Honest about thermal throttling.
Android on-device AI stack — Phi-3.5 Mini / Llama 3.2 3B via MLC LLM or Qualcomm AI Hub
Ship an Android app that runs a 3-4B-class LLM on-device for summarization, classification, on-the-edge voice processing, or offline-first features. Battery + thermal aware. Decide between MLC LLM (cross-device) vs Qualcomm AI Hub (Snapdragon-locked) honestly.
What a stack is (and isn't)
A stack is the answer to the question “what do I actually install, in what order, on what hardware, to get this specific outcome.” It pins five layers — model, runtime, memory (KV cache + RAG store), protocol (API surface, MCP, agent framework), and hardware — into a single recipe with concrete versions, commands, and the expected end state. The stack assumes you know what each layer does in the abstract and shows you which specific picks compose into a working system.
It is not a generic listicle. There is no “top-10-things-you-might-want” energy. A stack either runs on the hardware it specifies or it doesn't; a stack either produces the named outcome or it doesn't. Stacks that don't hold up under that test get rewritten, deprecated, or marked “needs reproduction.”
How stacks relate to the rest of the site
The model catalog and tool catalog tell you what exists. The hardware verdicts tell you which hardware is worth your money. The operator paths tell you the order to learn things in. Stacks tell you the assembled architecture for one specific use case — what model + what runtime + what hardware + what wrapper code. When a buyer guide recommends a card, that card shows up under “Featured in stacks” on its verdict page — stacks are the glue between the catalog and the deployed system.
Editorial discipline
Every stack pick on these pages carries a one-line operator note that explains why this and not the obvious alternative. That's the difference between a recipe that compounds authority and a recipe that's a glorified affiliate list. Stacks are graph nodes — every component picked here adds an incoming edge to a catalog entity. Suggestions for the next stack we should write? Drop us a line.