Trust evolution · Benchmarks

How benchmarks earn confidence

The four-state ladder explained on the trust index is the public face of a more detailed engine. This page documents the engine: what makes a row climb, what gets a submission rejected outright, how reproductions are verified, and what happens to a measurement when it stops being current.

By Fredoline Eruo · Last reviewed 2026-05-07

Confidence engine factors

The engine in src/lib/benchmarks/confidence.tsaccumulates a small number of signals and maps the result to one of four tier labels. The signals, in roughly decreasing order of impact:

Reproduction count. The largest single factor. The first independent reproduction is a step change — moving from “one operator says so” to “two operators say so” is conceptually a different category, not an incremental gain. Subsequent reproductions add weight with diminishing returns; by the fourth or fifth confirmation the row is at very-high and additional confirmations no longer move it.

Age. Time since the recorded measurement date. Confidence decays gradually for the first 18 months. Past 18 months, a row drops a tier automatically and the public detail page begins rendering an explicit Stale modifier. Beyond 24 months, very-stale rows are reviewed for either a refresh or a deindex decision.

Variance across rows. When multiple measurements exist for the same model + hardware + runtime configuration, the spread is itself a signal. Tight clustering (rows within ~10% of each other) lifts confidence — the configuration is stable. Wide spread (30%+ divergence) drops confidence and surfaces the configuration to editorial review.

Runtime consistency. Each measurement is anchored to a specific runtime version. When a runtime ships a major version with kernel changes — vLLM rewriting an attention path, llama.cpp dropping a flag, ExLlama changing a scheduler — rows on the prior version drop a tier. The annotation table that drives this is small and editorially maintained; it is not automatic.

Missing fields. Every blank in the discipline set costs the row points: blank runtime version, blank driver version, blank OS label, missing VRAM peak, missing TTFT. A row missing all of these caps at low regardless of how many reproductions it accumulates, because no future reproducer can match the configuration cleanly. This is the easiest factor for a contributor to address — fill in the form fields, and the row jumps a tier on the next moderation pass.

The full per-factor mechanics — including the outlier-penalty rule and the per-runtime version-bump table — live at /resources/confidence-methodology. That page is the technical reference; this one is the trust explanation.

When a submission is rejected

Rejected submissions never appear on the site. Not in archive views, not in stale-data dumps, not in the audit feed. The rejection is final. An operator whose submission is rejected is not banned from the site — they can submit again, ideally addressing the rejection reason — but the specific rejected row is gone.

The criteria that trigger rejection, written plainly:

  • Implausible numbers. A claimed decode rate that is more than 3x the median for similar configurations is flagged automatically and gets editorial review. Some of these are genuine outliers caused by an unusual driver branch or a non-default power setting; those get approved with editorial notes documenting the configuration. Most of them are measurement errors — wrong context length, wrong quant, server-mode misconfigured — and those are rejected.
  • Missing critical metadata. A submission with no runtime version, no driver version, no OS label, and no quant format cannot anchor to a specific configuration; no reproducer could match it cleanly. Rejected with a note asking for the missing fields.
  • Hardware mismatch. A submission that claims a card the operator demonstrably does not own — caught when the submitter has prior submissions on a different tier of hardware that contradict the claim, or when the photo evidence obviously shows a different card. Rare; rejected when found.
  • Adversarial submissions. Submissions that appear designed to manipulate the catalog rather than report a measurement — duplicate floods, copy-paste of another contributor's row with the submitter name swapped, submissions that paste vendor marketing numbers as if they were measured. Rejected, and the submitter is flagged for editorial scrutiny on future submissions.

Reproduction methodology

A reproduction is not a free upvote. The protocol is documented at /resources/reproduction-guide in operational detail; here is the trust-relevant summary.

A reproduction must run the same model at the same quantization on hardware in the same tier using a runtime version that does not have a recorded breaking change relative to the original. The measurement must include the median of three runs after a warmup run (the first run is discarded; thermal and cache state make it unreliable).

A reproduction counts as successful when the result is within ±15% of the original tok/s. The 15% band reflects real- world noise floors on consumer hardware: thermal headroom, background processes, power-management state, driver branch differences. A tighter band would flag legitimate runs as failures; a looser band would let drift accumulate. 15% is the compromise.

Reproductions outside the band are not rejected — they are published as separate rows with the spread visible. A configuration where reproductions diverge by 20–30% is itself useful information; the operator reading the page can see the spread and judge for themselves whether the configuration is worth pursuing. The trust apparatus does not silence disagreement; it surfaces it.

The state transitions that follow a successful reproduction are documented at /resources/verification-policy. Briefly: Community submitted moves to Reproduced on first successful reproduction, and to Independently reproduced when the third independent operator confirms.

Stale-data retirement

Three thresholds govern a benchmark's lifecycle.

0–18 months. Confidence decays gradually but the row renders normally. The byline date and run date are visible; the operator can see how recent the measurement is and weigh it accordingly.

18–24 months. The public detail page renders an explicit Stale badge. The row is still useful but flagged. Editorial decides whether to refresh, retire, or annotate.

24+ months. Very-stale. The row is reviewed for either a refresh measurement or a retirement note. Retired rows do not disappear silently — the page remains, the byline remains, the run date remains, but a clear retirement banner explains that the runtime ecosystem has moved enough that the original number is mostly historical.

The reason these thresholds exist: the runtime ecosystem moves fast enough that a 2-year-old llama.cpp benchmark is mostly directional information. Pretending otherwise would be a quiet version of the manufactured-numbers failure the third promise on the trust index rules out. Stale signals are how we are honest about it.

Where the engine cannot help

The engine is a measurement of the available evidence. It cannot manufacture certainty where evidence is absent.

A row at moderate with one reproducer and full metadata is not the same kind of object as an editorial measurement on owner hardware, even if the engine assigns them similar tier labels. The label captures the available signal; it does not capture the underlying source. That is why the community-submitted, reproduced, and editorial badges render alongside the tier — they convey the source channel directly, without compressing it through the engine.

And: the engine does not catch every error. A reproducer who runs the wrong quant format and reports a number that happens to fall within ±15% of the original by coincidence will register as a successful reproduction. Editorial review catches the obvious cases; the long tail is partly defended by variance flagging and partly by the same thing every benchmark site relies on — the assumption that most operators are not adversaries. We accept that limit and disclose it rather than pretending the system is foolproof.

Where to go next

The operator-facing protocol that turns a sitting benchmark into a reproduced one. Contributing a successful reproduction is the most direct way to lift a row's confidence.