Backlink-ready visuals

Resources

Original diagrams and reference assets for local AI. Free to embed in articles, blog posts, GitHub READMEs, slide decks — attribution appreciated. Each diagram is hand-built SVG, dark- mode aware, accessible, and dependency-free.

License: CC-BY-4.0. Suggested citation: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0

Methodology

#methodology

The trust layer behind every score grid, benchmark badge, and confidence tier in the catalog. Operator-language formulas, the four-state verification ladder for community submissions, the reproduction protocol that lifts rows up that ladder, and the honest limits of any rule-based system.

Interactive calculators

#calculators

Operator-grade math, no email gate, no tracking. Pure client-side. Same formulas the engine pages use — just exposed for sharing, citing, and embedding in articles.

Quantization cheat sheet

#quantization-cheat-sheet

The bits-per-parameter footprint of every quant format the will-it-run engine knows about, from FP16 down to Q2_K. Color-coded by quality tier so the trade-off is visible at a glance — production-safe ≥6 bpp, sweet spot 4–6 bpp, degraded <4 bpp. The dashed line marks the Q4_K_M production sweet spot for 24 GB cards.

Quantization formats — bits per parameterHorizontal bar chart comparing the bits-per-parameter footprint of common LLM quantization formats from FP16 down to Q2_K. Bars are color-coded by quality tier: emerald for 6 bits-per-param or higher (production-safe), amber for 4 to 6 bits (sweet spot), rose for below 4 bits (degraded). A dashed vertical line at 4.83 bpp marks the Q4_K_M production sweet spot.Quantization formats — bits per parameterLower bits = smaller VRAM, higher quality loss0481216bits per parameterQ4_K_M sweet spot · 4.83 bppFP1616.00training referenceFP88.00Hopper / Ada-class onlyQ8_08.50near-FP16 qualityQ6_K6.60production-safeQ5_K_M5.50good fidelityQ4_K_M4.83sweet spot · defaultAWQ4.25GPU-friendly INT4EXL2_44.00ExLlamaV2 4-bitMLX_44.50Apple Silicon 4-bitQ3_K_M3.90noticeable lossQ2_K2.70emergency-fit only≥ 6 bpp · production-safe4–6 bpp · sweet spot< 4 bpp · degraded
Lower bits = smaller VRAM but more quality loss. Q4_K_M and AWQ-INT4 hit the production sweet spot for 24 GB cards. Bits-per-param numbers are the canonical BITS_PER_PARAM table the will-it-run engine uses — Q4_K_M is 4.83, not 4, because it preserves 6-bit weights on attention and FFN layers.

Local AI hardware checklist

#hardware-checklist

The eight stages between a parts list and a stable local-AI rig — VRAM, bandwidth, software class, PSU, airflow, PCIe lanes, NVMe, OS. Skip one and the bottleneck moves there. Use this as a pre-purchase gate before the spec sheet wins out over the workload.

Local AI hardware-buying checklistEight numbered hardware-buying stages for serious local AI operators: VRAM at least 12 GB, memory bandwidth at least 600 gigabytes per second, CUDA-class software, PSU at least 1000W Gold, case airflow, PCIe Gen4 x8 or better for multi-GPU, NVMe Gen4, and Linux. Each stage has a green checkmark and a single operator note. A right-rail call-to-action points to the custom-build engine.Hardware-buying checklistEight stages between you and a stable local-AI rig1VRAM ≥ 12 GBTable-stakes for 13B-class models. Below that you live in 7B world.2Memory bandwidth ≥ 600 GB/sDecode is bandwidth-bound. Compute matters less than throughput.3CUDA-class softwareOr accept the ROCm tax: kernel patches, narrower runtime support.4PSU ≥ 1000W GoldSustained 350W+ cards punish undersized PSUs under load.5Case airflowSustained inference = sustained heat. Mesh fronts beat glass panels.6PCIe Gen4 x8+ for multi-GPUNVLink-less dual cards lean on PCIe for tensor-parallel sync.7NVMe Gen4KV-cache spill, model swap, weight loading all hit storage hard.8Linux for serious operatorsvLLM / TensorRT-LLM / SGLang are Linux-first. Windows is a tax.Try in builderPlug your parts inGet tier scoresCompare combos/will-it-runSame constraints, wiredinto the engine.Skip a stage and the bottleneck moves there. Plan the slowest part first.
Eight stages between a parts list and a stable local-AI rig. The order is intentional: VRAM gates which models you can serve, bandwidth gates how fast they decode, the rest gate how long the rig stays up under sustained load.

Local AI stack architecture

#local-ai-stack

The seven-layer mental model — hardware to workflow. Where each concern actually lives. Use this to explain to teammates why one runtime change doesn't cascade into hardware changes (and vice versa).

Local AI stack architectureVertical seven-layer stack from hardware at the bottom to workflow at the top: Hardware, OS, Driver, Runtime, Model, Tool, Workflow.Workflowagents, RAG, scriptsL7ToolOpen WebUI, n8n, CursorL6Modelweights + tokenizerL5RuntimeOllama, vLLM, llama.cppL4DriverCUDA, ROCm, MetalL3OSLinux, macOS, WindowsL2HardwareGPU, CPU, RAML1higherlower
Each layer is a separate concern. A failure mode at one layer almost never has a fix at a different layer — VRAM exhaustion is hardware, a kernel mismatch is driver, a tokenizer bug is model. Diagnose at the layer that owns the symptom.

GPU memory flow under inference

#gpu-memory-flow

Where your VRAM actually goes. Model weights are the headline number, but KV cache + activations + runtime overhead consume meaningful budget — especially at long context. Lifts the most common 'why does my model OOM mid-task?' question.

GPU VRAM partitioning during inferenceTwo horizontal bars showing how 24 GB of VRAM is partitioned between model weights, KV cache, activations, and runtime overhead. The first bar represents a baseline 4K context. The second shows a 32K context where KV cache grows substantially.24 GB card32B AWQ-INT44K context18 GB24 GB usedLong context32K contextspills VRAM18 GB10 GB32 GB needed24 GB limitModel weightsKV cacheActivationsRuntime overhead
VRAM is not just weights. KV cache scales with context length and batch size, so a model that fits at 4K can spill at 32K on the same card. Plan headroom against the workload, not the weight file.

Multi-GPU topology — single, NVLink, PCIe, multi-node

#multi-gpu-topology

Total VRAM is not pooled VRAM. NVLink is not magic. PCIe-only multi-GPU is real but slower. Multi-node is bandwidth-bound. Settles the most common multi-GPU misconception in one panel.

Multi-GPU topology comparisonFour panels comparing GPU interconnect topologies: a single GPU, a dual GPU pair connected by NVLink at roughly 600 gigabytes per second, a dual GPU pair connected only via PCIe at roughly 64 gigabytes per second, and a multi-node setup using InfiniBand or Ethernet at roughly 25 to 50 gigabytes per second.Single GPUno interconnectGPUDual NVLinkdirect GPU-to-GPUGPU 0GPU 1NVLink ~600 GB/sDual PCIeno NVLinkGPU 0GPU 1PCIe rootPCIe5 x16 ~64 GB/sMulti-nodeInfiniBand / EthernetGPUGPUnode Anode BIB / Eth ~25-50 GB/s
Tensor parallelism is bandwidth-bound. NVLink lets weights and gradients move between GPUs at order-of-magnitude higher rates than PCIe, and networked nodes drop another order of magnitude. Match the topology to the parallelism strategy.

Runtime ecosystem 2026

#runtime-ecosystem

The runtime constellation around model weights. Each engine owns a distinct OS / hardware / workload sweet spot. Use this when picking between Ollama, vLLM, llama.cpp, MLX-LM, ExLlamaV2, SGLang, TensorRT-LLM.

Local LLM runtime ecosystemA radial diagram with model weights at the center surrounded by seven runtime options: Ollama, llama.cpp, vLLM, SGLang, MLX-LM, ExLlamaV2, and TensorRT-LLM. Each is color-coded by its primary operating system fit.modelweightsgguf · safetensors · mlxOllamallama.cppvLLMSGLangMLX-LMExLlamaV2TensorRT-LLMprimary OS fitLinuxmacOSWindowsCross-platform
Runtimes are not interchangeable. Each is shaped by its primary OS and hardware target. Pick the runtime closest to your stack rather than forcing a Linux-first project onto Windows or vice versa.

Local RAG architecture

#local-rag-architecture

The retrieval-augmented generation pipeline end to end: documents → chunker → embedder → vector DB → reranker → LLM → response. The reranker is the most undervalued stage; this diagram puts it where it earns its keep.

Local RAG architecture flowA horizontal pipeline with seven stages: Documents, Chunker, Embedder, Vector DB, Reranker, LLM, and Response. Each stage is connected to the next by an arrow.DocumentsPDF, MD, HTMLChunkersplit + overlapEmbedderBGE, E5Vector DBQdrant, pgvectorRerankercross-encoderLLMlocal modelResponsecited answeringest → ← queryall stages run on the same machine — no network hop, no third-party API
A local RAG pipeline collapses retrieval and generation onto one box. The embedder and vector DB live alongside the LLM, so query latency is bounded by disk and memory rather than any external API.

Embedding these diagrams

How to use these in your own writing without fuss.

Every diagram on this page is pure inline SVG with semantic labels. Three ways to embed:

  1. Screenshot the diagram from this page — the easiest path. Include the citation line.
  2. Right-click → save as SVG from your browser’s dev tools. Use as-is in articles or slides; the SVG is text-only, no embedded fonts, dark-mode-friendly via CSS class fills.
  3. Reference the source by linking to runlocalai.co/resources and naming the diagram by its anchor.

Suggested citation: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0

Found a diagram useful in something you published? We’d love to see it — drop us a note at support@runlocalai.co.

More to come

Roadmap for the visual layer.

  • Shipped — Quantization-format cheat sheet (FP16 → Q2_K)
  • Shipped — Hardware-buying checklist (8 stages)
  • Hardware-tier decision tree ($0 → $4000+)
  • Local-AI privacy checklist
  • Runtime × OS compatibility matrix (already at /compatibility)
  • Coding-agent architecture (OpenHands + vLLM + RAG + sandbox)
  • Mobile / on-device path (NPU → runtime → small model)

All assets here are CC-BY-4.0 — same citation as the other diagrams. Suggested credit: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0