Backlink-ready visuals

Resources

Original diagrams and reference assets for local AI. Free to embed in articles, blog posts, GitHub READMEs, slide decks — attribution appreciated. Each diagram is hand-built SVG, dark- mode aware, accessible, and dependency-free.

License: CC-BY-4.0. Suggested citation: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0

Methodology

#methodology

The trust layer behind every score grid, benchmark badge, and confidence tier in the catalog. Operator-language formulas, the four-state verification ladder for community submissions, the reproduction protocol that lifts rows up that ladder, and the honest limits of any rule-based system.

v17

Scoring methodology

Operator-language formulas for compatibility, runtime maturity, setup complexity, maintenance burden, stability, beginner-friendliness, Linux + mobile fit, VRAM-per-dollar, and perf-per-watt. Concrete examples per dimension; a tier- label reading guide; what scoring CAN’T capture.

Trust layer

10 dimensions

Reproduction guide

Ten-step operator protocol for reproducing a published benchmark, the matching set that defines a clean reproduction, the stopwatch flow for honest tok/s and TTFT, and what to do when your numbers don’t match.

Trust ladder

Submitter protocol

Verification policy

The four-state trust ladder for community benchmarks (queued / approved / reproduced / independently reproduced), the rejection criteria, the audit-log discipline, and how anonymity and credit work.

4 states

Audit log

v19

Confidence methodology

Four confidence tiers (low / moderate / high / very-high), six factors that move a row up or down, why we never publish percentages, automatic decay rules, and how to nudge a benchmark’s confidence upward as a contributor.

Confidence engine

6 factors

Interactive calculators

#calculators

Operator-grade math, no email gate, no tracking. Pure client-side. Same formulas the engine pages use — just exposed for sharing, citing, and embedding in articles.

VRAM calculator

How much GPU memory does a model need at your context length and quant? Splits weights, KV cache, activations, and runtime overhead with milestones for 8/12/16/24/32/48/80 GB cards.

kWh

LLM electricity calculator

What does running a local LLM cost in electricity? Watts × hours × kWh price → $/month, with the honest ChatGPT Plus break-even comparison and system-overhead caveats.

Quantization cheat sheet

#quantization-cheat-sheet

The bits-per-parameter footprint of every quant format the will-it-run engine knows about, from FP16 down to Q2_K. Color-coded by quality tier so the trade-off is visible at a glance — production-safe ≥6 bpp, sweet spot 4–6 bpp, degraded <4 bpp. The dashed line marks the Q4_K_M production sweet spot for 24 GB cards.

Lower bits = smaller VRAM but more quality loss. Q4_K_M and AWQ-INT4 hit the production sweet spot for 24 GB cards. Bits-per-param numbers are the canonical BITS_PER_PARAM table the will-it-run engine uses — Q4_K_M is 4.83, not 4, because it preserves 6-bit weights on attention and FFN layers.

Local AI hardware checklist

#hardware-checklist

The eight stages between a parts list and a stable local-AI rig — VRAM, bandwidth, software class, PSU, airflow, PCIe lanes, NVMe, OS. Skip one and the bottleneck moves there. Use this as a pre-purchase gate before the spec sheet wins out over the workload.

Eight stages between a parts list and a stable local-AI rig. The order is intentional: VRAM gates which models you can serve, bandwidth gates how fast they decode, the rest gate how long the rig stays up under sustained load.

Local AI stack architecture

#local-ai-stack

The seven-layer mental model — hardware to workflow. Where each concern actually lives. Use this to explain to teammates why one runtime change doesn't cascade into hardware changes (and vice versa).

Each layer is a separate concern. A failure mode at one layer almost never has a fix at a different layer — VRAM exhaustion is hardware, a kernel mismatch is driver, a tokenizer bug is model. Diagnose at the layer that owns the symptom.

GPU memory flow under inference

#gpu-memory-flow

Where your VRAM actually goes. Model weights are the headline number, but KV cache + activations + runtime overhead consume meaningful budget — especially at long context. Lifts the most common 'why does my model OOM mid-task?' question.

VRAM is not just weights. KV cache scales with context length and batch size, so a model that fits at 4K can spill at 32K on the same card. Plan headroom against the workload, not the weight file.

Multi-GPU topology — single, NVLink, PCIe, multi-node

#multi-gpu-topology

Total VRAM is not pooled VRAM. NVLink is not magic. PCIe-only multi-GPU is real but slower. Multi-node is bandwidth-bound. Settles the most common multi-GPU misconception in one panel.

Tensor parallelism is bandwidth-bound. NVLink lets weights and gradients move between GPUs at order-of-magnitude higher rates than PCIe, and networked nodes drop another order of magnitude. Match the topology to the parallelism strategy.

Runtime ecosystem 2026

#runtime-ecosystem

The runtime constellation around model weights. Each engine owns a distinct OS / hardware / workload sweet spot. Use this when picking between Ollama, vLLM, llama.cpp, MLX-LM, ExLlamaV2, SGLang, TensorRT-LLM.

Runtimes are not interchangeable. Each is shaped by its primary OS and hardware target. Pick the runtime closest to your stack rather than forcing a Linux-first project onto Windows or vice versa.

Local RAG architecture

#local-rag-architecture

The retrieval-augmented generation pipeline end to end: documents → chunker → embedder → vector DB → reranker → LLM → response. The reranker is the most undervalued stage; this diagram puts it where it earns its keep.

A local RAG pipeline collapses retrieval and generation onto one box. The embedder and vector DB live alongside the LLM, so query latency is bounded by disk and memory rather than any external API.

Embedding these diagrams

How to use these in your own writing without fuss.

Every diagram on this page is pure inline SVG with semantic labels. Three ways to embed:

Screenshot the diagram from this page — the easiest path. Include the citation line.
Right-click → save as SVG from your browser’s dev tools. Use as-is in articles or slides; the SVG is text-only, no embedded fonts, dark-mode-friendly via CSS class fills.
Reference the source by linking to runlocalai.co/resources and naming the diagram by its anchor.

Suggested citation: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0

Found a diagram useful in something you published? We’d love to see it — drop us a note at support@runlocalai.co.

More to come

Roadmap for the visual layer.

Shipped — Quantization-format cheat sheet (FP16 → Q2_K)
Shipped — Hardware-buying checklist (8 stages)
Hardware-tier decision tree ($0 → $4000+)
Local-AI privacy checklist
Runtime × OS compatibility matrix (already at /compatibility)
Coding-agent architecture (OpenHands + vLLM + RAG + sandbox)
Mobile / on-device path (NPU → runtime → small model)

All assets here are CC-BY-4.0 — same citation as the other diagrams. Suggested credit: Diagram by RunLocalAI · runlocalai.co · CC-BY-4.0