Scoring methodology — how the v17 catalog scores work
The catalog score grids you see on tool and hardware pages — Excellent / Strong / Acceptable / Limited / Poor — are computed, not handed down. This page documents exactly how every dimension is derived, what metadata feeds it, and what the score deliberately cannot capture.
Why we computed scores at all
For most of the catalog’s history we leaned on prose verdicts — the “our verdict” paragraph at the top of every tool and hardware page. Verdicts are honest and hand-written, but they are unsortable, uncomparable across pages, and invisible to anyone scanning a list. The v17 score grid solves the comparison problem without replacing the prose: every score is a computed signal — derived from catalog metadata via a documented formula — that lets comparison surfaces (per-card score grids, the runtime table at /maps/inference-runtimes-2026, and the build-vs-build matrix at /compare/builds) sort and filter consistently. The scores sit next to the prose; they do not replace it.
Composite scores — our position changed in May 2026
Historical position: for most of the catalog’s history we argued composite scores were dishonest — that a 92/100 implied the weighting matched what you actually care about, which is almost never true. A homelab operator on Linux weights perf-per-watt and Linux friendliness above beginner-friendliness; a first-time buyer on Windows weights the inverse. A single number obscures that mismatch behind false precision. We kept scores per-dimension.
What changed: the four critiques above are all real, but the diagnosis was incomplete. The dishonesty isn’t the composite itself — it’s the composite presented without a visible breakdown, without a confidence multiplier, and without naming the underlying anchor measurement. In May 2026 we shipped the RunLocalAI Score — a composite 0–1000 number — but designed it to answer each original objection head-on:
- The four sub-scores (Throughput / VRAM-fit / Ecosystem / Efficiency) are always rendered next to the headline, so the reader can immediately disagree with the weighting they don’t share.
- A confidence multiplier (1.00 measured · 0.95 measured-near · 0.85 community · 0.80 extrapolated · 0.70 estimated) penalises the headline when the underlying data is weak — no “850/1000” for a hardware row with zero measured benchmarks.
- The anchor model is named in the rationale — “Anchored to high-confidence measured benchmark on Llama 3.1 8B — 127 tok/s” — so the reader sees exactly which workload the throughput sub-score was calibrated to.
- The chip tier (S / A / B / C / D) is what reads at a glance; the raw 0–1000 number is for sorting. Tier breakpoints ladder honestly with the precision the inputs can support.
The per-dimension scores documented later on this page are still canonical for the comparison surfaces (/compare/hardware, the runtime table at /maps/inference-runtimes-2026, the build-vs-build matrix at /compare/builds). The composite RunLocalAI Score is an additional surface for the leaderboard and the hardware-page score card — not a replacement.
Why scores round to nearest 5
Rule-based scoring is approximate. The inputs (vendor, VRAM, bandwidth, category, GitHub stars) carry real signal but not enough to justify 76-vs-78 distinctions. Rounding every output to the nearest five prevents the false-precision trap — if our scoring shifts a card from 77 to 78, you should not notice.
Tier breakpoints sit on multiples of five too: 85+ is Excellent, 70-84 Strong, 50-69 Acceptable, 30-49 Limited, below 30 Poor. So the rounding and the tier alignment never conflict.
The 10 dimensions
Eight dimensions apply to every entity. Two — VRAM-per-dollar and perf-per-watt — only apply to hardware. Each section below names the dimension, states the formula in operator language, and walks through a concrete example.
Compatibility
Breadth of supported runtimes, OSes, and accelerator backends. For tools we count the unique OS targets and GPU targets in theosSupported and gpuSupported arrays, weight each at twelve points, and add a ten-point base. Four OSes plus four GPU vendors caps the dimension at Excellent. For hardware we score by vendor: NVIDIA = 95 (CUDA universe), Apple = 75 (full MLX + llama.cpp + Metal), AMD = 60 (ROCm covers the popular runtimes but often a release behind), Intel/Qualcomm = 35 (real but thin). Example: an RTX 4090 lands in Excellent because every runtime ships CUDA wheels first; an Intel Arc A770 lands in Limited because IPEX-LLM exists but most operators never reach for it.
Runtime maturity
Tool-only. Proxies for “is this stable enough to base a build on?” Inputs: GitHub stars (logarithmic, capped at 70), presence of a long operational review (+25 if the L1.5 review exceeds 600 chars), and presence of L1.25 enrichment body (+5 if > 700 chars). 50K+ stars plus a real review pegs the dimension at Excellent. Example: llama.cpp and vLLM both clear the maturity bar comfortably; a 2K-star fork without an editorial review lands in Limited regardless of how clever the code is.
Setup complexity (inverse — lower is better)
How much infrastructure work to go from zero to a running model. Inverse-scored: a low complexity ranks as Excellent because the operator burden is low. Per-tool overrides set the canonical install path: Ollama and LM Studio at 95 (one-line install, no driver wrangling), Open WebUI / AnythingLLM at 85, llama.cpp at 60 (a build step but no Python soup), vLLM and SGLang at 45 (CUDA wheel + Python env management), TensorRT-LLM at 25 (engine-build step is real work). Example: a beginner reaching for vLLM is reaching past the comfort line — the dimension says Limited, and the verdict prose underneath says the same in human language.
Maintenance burden (inverse — lower is better)
How often a working install breaks under its own weight after three months. Heavy server frameworks (vLLM, SGLang, TensorRT-LLM) require version-pinning discipline; consumer GUIs (Ollama, LM Studio) auto- update without surprise. Per-tool overrides: Ollama and LM Studio at 85, runner / GUI category at 70, server frameworks at 40-50. Example: an RTX 4090 + Ollama install runs for eighteen months without intervention; the same hardware running TensorRT-LLM in production requires monthly engine rebuilds when the backend bumps. See /systems/local-ai-maintenance for the operator playbook this dimension proxies.
Stability
Long-session reliability — can you leave a coding-agent loop running for 24 hours without it falling over? For tools this is adoption- weighted (stars are a proxy for “has anyone hit this bug before”) plus an editorial-review bonus. For hardware this is vendor + tier: NVIDIA enthusiast/high tiers at 85, Apple at 80, AMD at 60. Editorial verdict presence adds +5 to confirm the rating is operator-tested, not assumed. Example: an RTX 3090 in a quiet workstation hits Excellent because the Ampere driver tree is the most battle-tested in the local-AI ecosystem.
Beginner-friendliness
First-time-user fit. Inputs: tool category, canonical install path, and explicit per-tool overrides. Ollama at 95, LM Studio at 90, GUI category at 70, llama.cpp at 30 (build step disqualifies for beginners), vLLM/SGLang/TensorRT-LLM at 15. For hardware: NVIDIA consumer cards at 80, Apple Silicon at 90, AMD at 50 (ROCm tax). The chooser at /choose-my-gpu uses this dimension to render a green “First-time-user friendly” chip when the user picks beginner skill level. Example: an Apple M3 Ultra plus LM Studio is the most beginner-friendly path on the catalog; a dual-3090 plus vLLM is the least.
Linux-friendliness
Linux ecosystem fit. For tools: 80 if the osSupported array includes Linux, 95 for the Linux-first servers (vLLM, SGLang), 30 if the tool is macOS- or Windows-only. For hardware: NVIDIA at 90 (CUDA reference platform), AMD at 80 (ROCm canonical path), Apple at 10 (macOS-only by definition), Intel at 70. Example: an Apple M3 Ultra is Excellent everywhere except this dimension, which lands at Poor because Linux on Apple Silicon is Asahi-grade and not a target for AI runtimes.
Mobile-friendliness
Hardware-only (in practice). Edge / mobile / battery suitability. Inputs: device type, NPU presence, Apple unified memory, sustained power. Mobile SoCs (Snapdragon X, Apple A-series in iPhone) and PC NPUs at 85, devices with an NPU at 75, Apple Silicon at 70 (low-power unified memory), low-power discrete GPUs at 50, desktop monsters (300W+) at 10. Example: an iPhone 17 Pro lands in Excellent for on-device chat at 4B Q4; an RTX 4090 lands in Poor here because no mobile chassis can carry it.
VRAM-per-dollar (hardware only)
Hardware-only. Formula: vramGb / currentStreetPriceUsd × 2500, capped at 100. We use street price when known and fall back to MSRP. The 2500 multiplier is calibrated against the used-3090-at-$700-with-24GB reference (≈ 0.034 GB/$ → 86, Excellent). A new RTX 4090 at $1,800 with 24 GB lands at 0.013 GB/$ ≈ 33, Limited. An Apple M3 Ultra at $5,000 with 192 GB unified memory lands at 0.038 GB/$ ≈ 95, Excellent. An H100 at $30K with 80 GB lands at 0.0027 ≈ 7, Poor — because per-dollar is not what an H100 buyer optimizes for.
Perf-per-watt (hardware only)
Hardware-only. Formula: vramBandwidthGbps / powerDrawW × 22, capped at 100. Bandwidth is the right proxy for inference perf because LLM decode is memory-bound. Calibration: an Apple M3 Ultra at 800 GB/s ÷ 180W = 4.4 GB/s/W, scoring 95+ (Excellent); an RTX 4090 at 1008 GB/s ÷ 450W = 2.24 GB/s/W, scoring 50 (Acceptable); an H100 SXM at 3350 GB/s ÷ 700W = 4.79 GB/s/W, scoring 100 (Excellent). The dimension is what makes Apple Silicon’s headline figure on the catalog real — not because Apple is faster, but because it does the same work for half the watts.
How we’d revise a score
Scores are derived from catalog metadata. When the metadata changes, the score recomputes automatically — the next time the page renders. The deliberate revision pattern:
- A vendor publishes a price drop. We update
currentStreetPriceUsdin the hardware row; the VRAM-per-dollar score lifts on the next deploy. - A runtime ships a major version with new OS targets (e.g. vLLM adding ROCm). We update the
gpuSupportedarray; the compatibility score lifts. - We complete an L1.5 operational review. The dimension that gates on review presence (runtime maturity, stability) gets the editorial bonus on the next render.
We do not shift dimension formulas without a methodology revision recorded in /changelog — moving formulas silently would break the comparability promise the score grid makes.
When scores become stale
Scores derived from catalog metadata are stale when the metadata is stale. The two most common drift sources:
- Hardware street prices — used-market 3090s drifted from $1,200 in late 2022 to $700 in 2024 to $550 in 2026. Each of those moves shifts the VRAM-per-dollar dimension by a tier or more. We re-verify hardware prices on the schedule documented at /editorial-policy.
- Tool maturity — a 12K-star runtime hitting 30K stars within twelve months crosses the maturity ceiling. Stars are re-fetched on a regular cadence so the dimension stays current without manual intervention.
Per-page review dates appear in the byline strip on every entity page. If you see a review date older than eighteen months, treat the score grid with the same skepticism you would treat a stale benchmark — the directional signal is usually still right, the precision is not.
How to read tier labels
- Excellent (85-100). The dimension is a strength — this is among the best in the catalog for that signal. Reach for this entity if the dimension is your priority.
- Strong (70-84). Solid. Not the leader, but no serious gap. Most operator builds live in this range across most dimensions.
- Acceptable (50-69). Workable but not optimal. A tradeoff is being made. Read the verdict prose to understand which.
- Limited (30-49). A real weakness. The entity is fighting against the dimension. Pair it with a counterpart that covers the gap, or pick a different entity.
- Poor (0-29). A mismatch. Either the dimension is actively bad (NVIDIA on macOS = 0) or the data is missing. Do not rely on this entity for that signal.
The honest limits — what scoring CAN’T capture
A rule-based scoring system measures what its rules know to look for. The signals we deliberately do not encode:
- Workload-specific quality. A score does not know whether your prompt happens to expose a tokenizer edge case in a specific runtime, or whether a model’s instruction-following collapses past 16K context. Those failures show up in benchmarks and editorial verdicts, not in the score grid.
- Vendor-specific drift. ROCm 6.0 → 6.1 broke things that ROCm 6.2 fixed. The compatibility score does not move during that window. Editorial review notes do.
- Subjective fit. Two cards can both be Strong on beginner-friendliness; a real beginner will still find one friendlier than the other. Read the prose.
- Time-of-purchase market conditions. Allocation shortages, multi-week regressions, a successor model paper — none reflect in scores until the metadata catches up.
The score grid is a fast-scan layer on top of the verdict prose, not a replacement for it. When the score and the prose disagree, the prose is the truer signal — the score grid exists so comparison surfaces have something honest to sort by; the verdict exists so the right entity ends up on your shortlist.
Adjacent reading: /editorial-policy for how verdicts are written, /how-we-make-money for the affiliate-disclosure surface that complements this trust layer, and /changelog for any methodology revisions.