The math behind every prediction.
Nothing hidden, nothing hand-waved. Every number on /will-it-run comes out of the formulas on this page. If a number on the site is wrong, it's wrong here too — we'd rather you find the bug than trust us blindly.
For a citation-ready definition of the named framework, use the RunLocalAI Will-It-Run Framework.
Last reviewed 2026-05-13 · revision 02 · other methodology pages →
The trust model
Every prediction on the site sits on a 5-tier confidence ladder. The chip on each row tells you which tier the number comes from — and the RunLocalAI Score applies a multiplier (1.00 / 0.95 / 0.85 / 0.80 / 0.70) so the headline never pretends to be more solid than the underlying data.
The full ladder with weights and tier mappings is in § 10 — Confidence ladder. The TL;DR for first-time readers:
- M / M~Measured. We (or a reproduced contributor) ran the combination — exact match (M) or same hardware at a different ctx/quant (M~). Trust most.
- CCommunity. Submitted via /community with citation, editorially reviewed, not yet independently reproduced.
- ~Extrapolated. We have a measurement on similar-bandwidth hardware and scaled it via the bandwidth ratio.
- EEstimated. Pure formula — no benchmark yet for this hardware tier. Trust least.
VRAM accounting
Loading and running a model needs four pools of memory. Sum them, and you have the answer to “will it fit?”
total_vram = weights + kv_cache + activations + runtime_overhead
Weights
Calculated from the model's parameter count and quantization precision:
weights_gb = (parameter_count_b × bits_per_param) / 8
The non-obvious part: bits_per_param isn't always what the quant name suggests. Q4_K_M uses 6-bit precision on attention and feed-forward layers and 4-bit elsewhere — its effective average is 4.83 bits/param, not 4. Same logic for other K-quant variants. We use the calibrated values, not naive ones.
Activations and runtime overhead
Activation memory empirically scales as 0.05 × weights_gb + 0.001 × context_length — about 5% of weights, plus ~1KB per context token for intermediate buffers.
Runtime overhead (driver, kernel buffers, allocator fragmentation) is fixed per backend: ~1.8 GB for CUDA, ~2.0 GB for ROCm, ~0.7 GB for Metal/MLX, ~0.5 GB for CPU paths.
KV cache (the part most calculators get wrong)
Every token in the context window costs key-value cache memory. With Grouped-Query Attention (GQA — used by every modern open model), the formula is:
kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim
× context_length × batch_size × bytes_per_elementThe 2 accounts for both K and V tensors. The crucial detail: it uses num_kv_heads, NOT num_attention_heads. For Llama 3.1 8B that's 8 KV heads vs 32 attention heads — a 4× smaller cache than a naive calculator would predict.
kv_cache_bytes = 2 × 80 layers × 8 KV heads × 128 head_dim × 1024 × 1 × 2 bytesDecode speed (tokens per second)
LLM token generation is memory-bandwidth bound, not compute bound. Every token requires reading the full model weights from memory. The theoretical ceiling is:
peak_tps = bandwidth_gbps / weights_gb
Real-world tok/s is some fraction of peak, depending on the runner + backend pair:
| Runner + backend | Efficiency |
|---|---|
| ExLlamaV2 (CUDA) | 80% |
| llama.cpp (CUDA) | 65% |
| vLLM (CUDA single-stream) | 70% |
| MLX (Metal) | 70% |
| llama.cpp (Metal) | 62% |
| llama.cpp (ROCm) | 55% |
| llama.cpp (Vulkan) | 45% |
| CPU-only | 45% |
| NPU-class (vendor SDK / ONNX RT) | 40% |
| Hybrid GPU/CPU offload | 30% |
peak_tps = 960 / 4.9 ≈ 196 tok/sTime to first token
Prefill (processing the prompt before generating any tokens) is compute-bound, not memory-bound. The full prompt is processed in parallel, so matrix-multiplication throughput is the bottleneck:
ttft_ms = (prompt_tokens × params × 2) / (compute_tflops × efficiency)
For most consumer cards on a typical 512-token prompt this is well under a second. Old cards or huge prompts can push it past 2 seconds — at which point we surface “TTFT: slow” on the row. We currently predict TTFT only when the hardware has known FP16 TFLOPS in our database.
CPU offload (harmonic mean)
When a model doesn't fit fully in VRAM, llama.cpp can split layers between GPU and CPU. The combined speed is the harmonic mean weighted by layer fraction:
combined_tps = 1 / ( gpu_frac / gpu_tps + cpu_frac / cpu_tps )
CPU tok/s depends on memory bandwidth class — DDR5-5600 dual-channel ≈ 80 GB/s effective, DDR4-3200 ≈ 40 GB/s, an 8-channel server (DDR5 EPYC) ≈ 360 GB/s. We multiply by 0.4 to account for AVX-2/AVX-512 efficiency vs theoretical FP32 throughput.
This is why our tool marks “CPU is the bottleneck” when the CPU side of the offload is the slower path. Most calculators don't tell you this, and the answer is often counterintuitive: upgrade RAM, not GPU.
Apple Silicon (unified memory)
Unified memory means CPU and GPU share one pool. There's no PCIe transfer cost. Effective bandwidth is the chip's memory bandwidth: 546 GB/s on M4 Max, 819 GB/s on M3 Ultra. MLX-LM gets a 0.70 efficiency factor; llama.cpp Metal gets 0.62.
The 0.75 OS-reserve factor. For the Score and Fit utilities we treat systemRamTypicalGb × 0.75 as the effective “VRAM” budget — i.e. assume 25% of unified memory is reserved for the OS + running apps + dock cache. That number comes from Activity Monitor measurements on M-series Macs running a typical daily-driver workload (Safari + Mail + a notes app + Activity Monitor itself). On a 64GB M3 Max with 22GB of model weights loaded, memory_pressure starts climbing past ~80% allocation — which puts the honest ceiling at ~48GB usable, matching the 0.75 factor. The factor is conservative on purpose: a Mac with a minimal daemon set (no Safari, no Mail) can push closer to 0.85, but we'd rather under-promise.
Calibration note: the 0.75 figure was editor-measured on macOS Tahoe 26.2 (M3 Max 64GB, typical daily-driver load — Safari + Mail + Activity Monitor + a notes app), May 2026. The constant remains valid through macOS 26.5: Apple's 26.5 release notes describe no memory-subsystem changes that would shift the daemon overhead. If your Activity Monitor numbers disagree on a current build, tell us via /community and we'll refresh the constant.
Mixture-of-experts (MoE)
For MoE models like Qwen 3 235B-A22B (235B total / 22B active) or Llama 4 Scout (109B / 17B active), VRAM is gated by total parameters but speed is gated by active parameters. We use:
vram_footprint ← uses parameter_count_b (total) speed_estimate ← uses active_parameter_count_b (per-token)
This is why an MoE model can fit on a card that struggles with a dense model of the same total size, but runs slower than a dense model of the active size.
Use-case ranking
When you pick a use case (Coding, Reasoning, Vision, etc.), we re-rank the comfortable tier by a fit score AND filter out models too small to be useful for the workload:
- Direct tag matches on the model's
use_casesfield - Family heuristics (e.g., DeepSeek R1 family scores higher for reasoning)
- Slug heuristics (e.g., anything with “coder” in the slug for coding)
- Hard filters (Vision = multimodal-only)
- Parameter-count modifiers (bigger models score higher for harder tasks like reasoning)
- Minimum-params floor per use case: chat ≥0.5B, coding ≥3B, agents ≥7B, reasoning ≥7B, vision ≥1B, long-context ≥3B, creative ≥1.5B. A 360M-class model gets dropped from the agents ranking before fit scoring runs.
final_rank = 0.6 × use_case_fit + 0.4 × normalize(predicted_tok/s) where normalize(tps) = min(100, log10(max(1, tps)) / log10(120) × 100) → 60 tok/s ≈ 85 normalised → 200 tok/s ≈ 100 (capped — "fast enough") → 5 tok/s ≈ 33
The normalisation matters. Earlier versions used raw tok/s as the speed component, which let tiny-fast models (e.g. SmolLM 2 360M at 2000+ tok/s on a 4090) crush actually-useful coders in the ranking. With the log normalisation, “fast enough” saturates at ~60 tok/s and the use-case fit decides the order.
Weighted toward fit, but speed still matters.
Confidence ladder
Every tok/s number gets a one-letter badge:
| Badge | Tier | Meaning |
|---|---|---|
| M | Measured | We ran this exact model + hardware + quant + runner combo. Most accurate. |
| M~ | Measured-near | We ran this model on this hardware at a different context size or quant; extrapolated by quant ratio. |
| C | Community | Submitted by a third party with citation. Useful for breadth, less reliable than measured. |
| ~ | Extrapolated | We have a measurement for this model on similar-bandwidth hardware; scaled by bandwidth ratio. |
| E | Estimated | Pure formula. No benchmark data for this model on this hardware tier yet. |
The weight column drives composite scoring — see the RunLocalAI Score page for how confidence-weighted aggregates roll up.
Bench protocol (V36.52)
The downloadable runlocalai-bench CLI implements a single fixed protocol we tag V36.52:
- 5 runs — 1 cold-start (excluded from steady-state aggregates) + 4 steady-state runs.
- 2-second pause between runs so kernel-launch cache pressure is consistent.
- Median + P5 + P95 computed over the 4 steady runs. The median is the headline.
- Variance gate — submissions where the (max − min) / median exceeds 20% are flagged for editorial review; clean runs land on medium confidence directly.
- Ollama-only. The current CLI shells out to
ollama run --verboseand parses its stderr. Community submissions are therefore Ollama numbers. vLLM / MLX / llama.cpp rows in the catalog come from editor-run measurements or cited third-party sources — see § 14.
curl -fsSL https://runlocalai.co/bench.mjs -o bench.mjs node bench.mjs --hardware your-gpu-slug --model llama3.1:8b --submit
Freshness SLA
The public /freshness audit buckets every catalog row by age of last review. These thresholds are descriptive, not promissory — the editorial cadence is best-effort, not contractual:
- Fresh (< 30 days). Touched within the last month. Default state for actively-maintained surfaces.
- Recent (30–90 days). Likely still accurate, may benefit from a refresh on pricing or driver-version drift.
- Aging (90–180 days). Flagged for editorial review. Numbers should be re-verified before high-stakes use.
- Stale (> 180 days). Treat with caution — model versions, runtime patches, and street pricing have likely shifted. Submit a measured benchmark to refresh the row.
- Never reviewed. Catalog row without a review timestamp. Surfaced separately on /freshness rather than silently filtered.
Known limits & caveats
- No thermal model. A poorly-cooled laptop will perform 10–30% below predictions during sustained inference. We mark this on hardware verdicts but don't auto-discount.
- Driver versions matter. New CUDA / ROCm / Metal drivers can change real performance by 5–15%. We only track this when it shows up in measured benchmarks.
- Multi-GPU approximation. Tensor parallelism is modelled as 2× VRAM, ~1.5× speed. Real numbers vary by interconnect (PCIe vs NVLink) and runner support.
- No spec-decoding bonus. Speculative decoding and draft-model setups can multiply speed by 1.5–3× but require paired models. Not predicted yet.
- FP16 KV default. KV cache quantization (FP8 / INT4 KV) is supported in our math but defaults to FP16 unless the runner is known to FP8-by-default.
Calibration sources
The constants in our engine were calibrated against:
Found a wrong number?
Please tell us at Contact support. Include the URL, the model + hardware combo, and what you measured.
Even better — submit a measured benchmark with runlocalai-bench --submit:
curl -fsSL https://runlocalai.co/bench.mjs -o bench.mjs node bench.mjs --hardware your-gpu-slug --model llama3.1:8b --submit
We update the engine weekly; verified corrections ship within a week and get a credit on the page.