Trust evolution

Trust at RunLocalAI

A benchmark site is only useful if you can trust the numbers. This page explains, in plain English, how we earn the right to publish one. The four-state ladder a measurement climbs, the three things we have promised never to do, and the parts of the system where we are honest about what we cannot prove.

By Fredoline Eruo · Last reviewed 2026-05-17

Why trust is the product

Anyone can scrape model cards, list a few GPUs, and call the result a benchmark site. The hard part — the part that takes years and is the reason this site exists — is being a place where the number on the page is the number you would get if you ran the same configuration on the same hardware tonight. Approximately. Within known noise. With the caveats clearly listed instead of hidden in a footnote.

That is the entire product. Not the directory, not the calculators, not the comparison pages. The product is: an operator landing here from a search engine can trust the row they are reading. Everything else exists to support that one promise. The trust apparatus on this page is the contract that makes the rest of the catalog meaningful.

We document the apparatus publicly because the alternative — a proprietary “trust score” that nobody can audit — is what every competitor does. We refuse it. If our methodology cannot survive being read by a skeptical engineer, the methodology is the problem and we should fix it, not hide it.

The four-state trust ladder

Every benchmark row on the site sits at one of four states. The state is the answer to a single question: how confident are we that this number reflects reality? The states form a ladder; a row enters at the bottom and climbs as evidence accumulates.

Community submitted

Community submitted. An operator has submitted the measurement, an editor has reviewed it for plausibility and metadata completeness, and it is waiting for reproduction. One source is useful for triage, but it is not load-bearing evidence for a hardware purchase decision and does not enter public numeric surfaces.

Reproduced

Reproduced. A second operator has run the same configuration and arrived within ±15% of the original. Two sources, agreement. The measurement now has a real claim to being a measurement of the world rather than of one particular machine.

Independently reproduced

Independently reproduced. Two or more independent operators agree with the original. Three sources or more, agreement. This is the strongest community signal we publish. It still has limits, but it is far more useful than a single unverified leaderboard number.

Editorial

Editorial. Measured by RunLocalAI on hardware listed on the About page, using the exact protocol in the reproduction guide. The author byline is a named human; the run date is recorded; the reproducible command is published. A well-resourced reader can re-run the measurement and check our work.

The ladder is intentionally short. A longer ladder with more intermediate rungs would imply finer discrimination than the underlying signal supports. Four states is the most we can meaningfully distinguish; collapsing to three would lose the gap between “one operator says so” and “two operators say so,” which is a step change rather than an incremental gain.

Two further signals can layer on top of any state on the ladder: Stale for measurements older than 18 months, and Verified owner for submissions from operators we have editorially reviewed as hardware owners (see the operators page for what that actually means and what it does not). Stale and verified-owner are modifiers; they do not move a row up or down the ladder by themselves.

Three promises we have made

Three rules constrain the entire system. They were not chosen because they were convenient. They were chosen because each one rules out a specific failure mode that other benchmark sites routinely exhibit.

1. We never auto-publish

No submission becomes public without an editor reviewing it. Not rate-limited public, not tentatively-public, not provisionally public - never public until a named editor has read the row and approved it. Most submissions are approved within 48 hours; some are not approved at all. Rejected submissions never appear on the public site, in archive views, or in search results. We retain a private moderation/audit record so reviewers can detect abuse and explain future decisions, but the rejected numeric claim is never presented as evidence.

The discipline matters because the cheap version of a benchmark site is one where every submission auto-publishes and a moderation queue catches problems after the fact. That model produces a site where the median row is unverified. Editorial review before publication is slower; it is also the entire reason the median row on this site is something an operator can act on.

2. We never publish percentages

The confidence engine internally accumulates a numeric score, but the score never reaches the page. We render four tier labels — low, moderate, high, very-high — and that is the entire vocabulary. A “78% confidence” pill on a benchmark row would be false precision; the underlying inputs are heuristics. Tier labels are honest about what the engine actually knows.

The same discipline applies elsewhere. Catalog scores render as tiers, not percentages. Decode rates round to one decimal. TTFT rounds to the nearest 10ms. Wherever a number could be reported to a precision that exceeds the noise floor, we round it down. False precision is operator-hostile.

3. We never invent numbers on sparse data

If we do not have a benchmark for a given model on a given GPU, we do not publish a benchmark row for that pair. We may show a predicted speed inside the Will-It-Run checker, but it stays labeled as estimated or extrapolated and is excluded from public benchmark evidence exports. The benchmark table renders the empty state honestly - a card that says “no benchmarks yet, submit one” with a link to the submission form.

The RunLocalAI Will-It-Run Framework computes predicted feasibility based on VRAM math and may estimate decode speed from bandwidth when no exact row exists. The promise is narrower and stricter: predicted speed never becomes a benchmark row, never receives an owner-measured badge, and never enters the evidence snapshot until measured or reproduced data exists.

Where the catalog stands right now

The promises above are easier to take seriously when there are numbers attached. This block is a live read of the catalog at page load — not a snapshot, not a dashboard. The numbers move when we publish.

CATALOG · HARDWARE

154

149 with editorial verdict

CATALOG · MODELS

315

272 with editorial verdict

OWNER MEASURED

strict public evidence gate

SOURCE-BACKED

reproduced official/community rows

REPRODUCED

independent reproduction attached

ESTIMATES EXCLUDED

excluded from public benchmark table

AUDIT KNOWN FALSE POSITIVES

tok/s · benchmarks · peer-comparisons*

BENCHMARK COVERAGE

0.3

benchmarks per hardware unit

* The audit metric. A regex audit (scripts/v38b-audit-verdicts-hardened.ts in the repo) sweeps every editorial verdict — hardware and models both — for unsourced tok/s claims, fabricated benchmark scores, and peer-model comparisons that lack attribution. The hardening pass took the count from 366 flags to 3. Three known-false-positive flags remain after the V38b hardening pass; each is properly attributed in-text and the audit script flags them as expected. The promise rule is that the count of unattributed claims stays at zero; we re-run the audit before any verdict batch ships.

The honest caveat. Benchmark coverage at 39 measurements across 154 hardware units is the catalog’s most-visible content gap — it’s the long-term backfill that turns “catalog” into “benchmark hub.” Until coverage thickens, the workload-fit matrix on each hardware page extrapolates from VRAM × bandwidth math, and the page labels that extrapolation explicitly as such. We’d rather show extrapolation marked clearly than invent a tok/s number to fill the row.

What we cannot prove

Three places where we are honest about the limits of our verification. The first defense against being misleading is being precise about what we are not claiming.

Hardware ownership. When a community submission carries the verified-owner modifier, we mean an editor has reviewed evidence — build photos, prior contributions, public posts about the hardware — and concluded the operator probably owns the machine they claim to own. We cannot prove it. There is no cryptographic attestation chain for a workstation under a desk. Verified-owner is editorial judgment, not a credential, and the operators page documents exactly what evidence we look for and where the limits are.

Reproducibility at scale. Editorial benchmarks ship a reproducible command. We cannot guarantee that every reader running the command on their own hardware will get the same number. Hardware varies, drivers vary, OS configurations vary, thermal headroom varies. We can guarantee that the command we published is what we ran, that the hardware in the byline is the hardware we ran it on, and that the number we recorded is the number we measured. Beyond that, the reproduction guide documents the protocol; the rest is the world being noisy.

Long-term accuracy. Runtime ecosystems move fast. A benchmark from twelve months ago on llama.cpp 1.0.4 is not a benchmark on whatever llama.cpp is shipping today. We retire very-stale benchmarks with explicit stale signals; we do not silently update old rows with new runtime behavior. Every benchmark on the site is a measurement at a moment in time, and the trust apparatus exists in part to make that moment legible — so a reader can decide for themselves whether the moment is still relevant to their decision.

Deeper reads

Three pages drill further into specific parts of the apparatus. Each is the operator-actionable detail behind one of the sub-systems summarized above.

/resources/will-it-run-framework - the named framework behind effective VRAM, model working set, fit tiers, and measured-vs-estimated labels.
/trust/benchmarks — the confidence engine factors, rejection criteria, reproduction methodology, stale-data retirement.
/trust/editorial — the editorial review process, conflict-of-interest discipline, scoring methodology, and how editorial accountability is logged internally.
/trust/operators — what verified-owner means in practice, the evidence we accept, the identity information we never require, the hideContributor flag, and how reputation is editorially earned rather than algorithmically scored.

How we are funded

Some hardware links on this site are affiliate links, and we run contextual ads. That revenue never changes a verdict, a benchmark number, or a trust-state assignment — the editorial and commercial sides are walled off, and a recommendation is never pay-to-play. The full breakdown — which links pay us, which do not, and how the wall is enforced — is on how we make money.

Where to go next

The effective-VRAM and fit-tier methodology behind the compatibility engine.

Will-It-Run Framework

OrHow benchmarks earn confidence Editorial review process Operator verification process