Trust moat · Verification policy

How RunLocalAI verifies community benchmarks

When a community contributor submits a benchmark, where does the row end up? Why does one row render with a green “Reproduced” badge while the next row, on the same hardware, renders with a neutral outline? This page documents the policy that drives those decisions, so you can decide for yourself how much to trust any row you read.

By Fredoline Eruo · Last reviewed 2026-05-07

Why we document this in public

The trust moat for a content site lives in the gap between the published number and the reader’s confidence in it. We can publish careful editorial measurements, accept thoughtful community submissions, and run a verification flow on top — but if the machinery is invisible, the reader has no way to tell whether what they’re reading is a measurement, a single-source claim, or a cross-validated result. So we publish the policy. Every state, every transition, every rejection criterion. The trust signal in the badge only works if you can audit how it got there.

The four-state trust ladder

Every community benchmark row lives in exactly one state. The state determines which trust badge renders on the public detail page and how the confidence engine weighs the row.

1. Queued

Submitted to the moderation queue, not yet reviewed. Rows in this state do not render publicly. They’re visible only in the admin moderation console, where editorial reviewers work the queue within a 1–7 day window. Most submissions land here for at most a few days before transitioning forward.

2. Approved-public

Editorial reviewed the row, found nothing disqualifying, and published it. The row renders with a Community submitted badge — neutral outline, deliberately understated. This is the lowest public-facing trust tier. Treat the number as a single data point from a single operator: useful, but not yet validated. The confidence engine starts these rows in the low tier; good metadata can lift them to moderate.

3. Reproduced

An editor or a verified contributor re-ran the benchmark on similar/identical hardware following the discipline rules in the reproduction guide, and the numbers landed within tolerance (typically within 10% of the original). The row renders with a green Reproduced badge. The trust weight in the confidence engine lifts a tier; most reproduced rows reach high on the confidence ladder.

For a representative example of how the badge appears alongside the prose verdict, see any reproduced benchmark detail page (e.g. /benchmarks/1) — the badge sits next to the headline tok/s number; the build-context block below shows exactly what the reproducer matched against.

4. Independently reproduced

Two or more independent operators have reproduced the row, AND the original submitter is not one of them. This is the highest trust state a community benchmark can reach. The row renders with a blue Independently reproduced badge. The confidence engine moves these rows toward very-high if they also have complete metadata and recent runtime versions.

Editorial measurements skip the community-submitted entry point entirely — they start at the Editorial trust badge, the green-checkmark variant — but they can still pick up the independently-reproduced descriptor when independent operators confirm them. This is the strongest signal we publish.

The two terminal states

Two states close the loop without progressing forward:

  • Rejected. Editorial declined to publish. Documented in the moderation note. The row stays in the database (audit trail, see below) but never renders publicly.
  • Duplicate. The submission was substantively identical to an existing row (same model + hardware + runtime + quant within a few weeks). The duplicate is closed and a note links to the canonical row.

Why submissions get rejected

Reviewers reject (move a row to rejected) when any of the following applies. The criteria are deliberately specific so contributors can avoid them.

  • Implausible numbers. A submission claiming tok/s rates that exceed the theoretical memory-bandwidth ceiling for the hardware, or TTFT values that don’t match the runtime’s known prefill profile. Reviewers cross-check against the confidence engine’s outlier penalty before rejecting — sometimes the number is real and surprising — but a number that violates the laws of memory bandwidth gets rejected.
  • Missing context. A row missing enough metadata (runtime version blank, driver version blank, OS label blank, no notes) that a reproducer cannot match the configuration. We don’t publish rows that nobody could reproduce — the row would be a dead end.
  • Hostile content. Slurs, attempts to embed scripts in markdown notes, deliberately misleading framing, off-topic political content. Rejected, and the IP-hash gets added to the rate-limit watchlist. (We hash IPs on ingestion and never store raw values.)
  • Dishonest framing. Claims of editorial measurement on a community submission, misrepresented concurrent user counts, claims of hardware ownership the submitter clearly doesn’t have. Rejected, and if a contact email is on file the submitter gets a one-line warning.

Anonymity and credit

The default posture is anonymous. The submission form at /submit/benchmark does not require name, email, or any other identifying information. Most contributors leave those fields blank and we render the attribution as (anonymous).

When a contributor opts to provide a name, the public detail page renders it as plain text. When a submitter URL is also provided — a personal site, a GitHub profile, a Mastodon handle — the name becomes a link with rel="nofollow noopener": nofollow because we don’t pass link equity to user-supplied URLs, and noopener so the destination tab can’t access our window. Discord servers and Telegram groups are not appropriate as credit URLs and reviewers clear those fields before publishing.

Editorial veto exists. A boolean flag on each row lets reviewers force the public render to (anonymous) regardless of what the submitter typed. Used for spammy self-promotion in the credit URL, off-platform communities we don’t link to, hostile content the submitter tried to inject through the credit field. The submission stays in the database; only the attribution is suppressed.

How we assess “verified hardware ownership”

We accept three forms of evidence that a contributor actually owns the hardware they’re benchmarking on:

  • A reproducible benchmark history — the contributor has previously submitted multiple internally consistent benchmarks across different models on the same hardware row. The IP-hash mechanism lets us correlate submissions even when contributors are anonymous.
  • A contributor track record — named contributors with a public presence (a GitHub profile, prior published submissions on credible URLs) where the public profile mentions the hardware are credible by default.
  • A cross-checkable driver/runtime version pattern — submissions with non-default driver versions we can cross-check against known driver-release notes. Faking these requires either having the hardware or being unusually motivated.

What we don’t accept: screenshots of nvidia-smi (trivial to fabricate), photos of hardware (stock photos exist), or unsupported claims of ownership. The verification bar is deliberately pragmatic — we’re not a peer-reviewed venue. The point is to catch coordinated submission campaigns and obvious fabrications, not to demand a level of evidence that suppresses honest first-time contributors.

The audit-log discipline

Every state change on every community benchmark row writes a row to the editorial audit log: who changed it, what state it moved from, what state it moved to, the moderation note, the timestamp. Two rules we treat as non-negotiable:

  • We never delete a raw submission. Even rejected submissions stay in the database. The audit trail is what protects against accusations of selective moderation — if someone claims we silently buried their submission, the database shows the moderation note and the editor who closed it. Hard delete is reserved for legal-takedown scenarios.
  • Every state transition is logged. Including the queued-to-rejected path. Including the approved-public-to-reproduced path. Including reversals (which require a second editor’s signoff, a guard against single-reviewer drift). The audit log is the source of truth for reviewer accountability.

We don’t publish the audit log itself — moderation notes sometimes contain context that’s only useful internally. But we’ll surface a row’s state history to anyone who asks about a specific submission, and the policy here is what the log enforces.

How to read a benchmark’s trust badge

Three quick mental defaults for an operator landing on a benchmark page from search:

  • Editorial (green checkmark): we measured it, on our hardware, with reproducible commands. Treat as the strongest signal.
  • Reproduced or Independently reproduced (green / blue): a community-submitted number that’s been independently validated. Treat as nearly as strong as editorial.
  • Community submitted (neutral outline): one operator’s data point, reviewed for plausibility but not independently confirmed. Useful directional signal; treat with appropriate skepticism, and consider reproducing it yourself if your decision rests on it.

The confidence engine at /resources/confidence-methodology layers on top of the badge state to produce a tier label (low / moderate / high / very-high) that captures additional factors: row age, runtime-version drift, missing-field penalty, outlier checks. The badge tells you the verification state; the confidence tier tells you whether the row is also fresh, complete, and consistent.

Adjacent reading: /resources/reproduction-guide for the protocol that lifts rows up the trust ladder, /resources/confidence-methodology for how the badge state interacts with the confidence engine, and /editorial-policy for how editorial measurements themselves are produced.