Benchmark methodology — how we run scores | runlocalai

The trust gate

Quality benchmark leaderboards require raw public logs. Hardware tokens-per-second rows use the broader confidence ladder: source, operator, environment, reproduction state, and missing evidence are shown on the row, and sparse rows are labeled lower confidence.

A quality benchmark row appears on a public leaderboard if and only if it satisfies all of these conditions:

A public Gist URL in the test_run_log_url field, containing the full stdout + stderr of the run. No Gist, no public render. This is enforced at the SQL query level on every leaderboard page.
A reproducible runner in the public repo. Every benchmark we support has its runner script in scripts/. A third party can clone, install, and replicate independently.
Explicit rig metadata recorded per row: model slug, quantization, runtime version, hardware slug. Same model at Q4_K_M vs Q6_K is a different row, not a silent average.
A trust tier rendered on the row's badge (see below).

Trust tiers

FIRST-PARTY

runlocalai operator ran this score on a stated rig. The runner script + Gist log are the only evidence required. Default tier for everything we publish ourselves.

VERIFIED

A community submission that an operator has independently checked. Verification usually means re-running on our own rig and confirming the score is within noise margin (~±2pp).

COMMUNITY

Independent submission with a Gist log and reproduction command, documented but not yet re-verified by us. Renders publicly because trust gate conditions are met; the badge tells the reader this is independent evidence.

PENDING

Submitted via /benchmarks/submit but missing one or more trust-gate fields (Gist not posted, runner command unclear, etc.). Does NOT render publicly until corrected.

The reproduction contract

Every quality benchmark row carries the full reproduction context. Hardware tok/s rows carry these fields when available, and the public detail page marks missing fields as "Not provided":

Runner script — the exact file in our public repo that produced this score (e.g., scripts/run-humaneval-plus.ts).
Reproduction command — the exact CLI invocation, including all flags (quantization, runtime, endpoint, hardware tag).
Raw log Gist — stdout + stderr verbatim. Anyone can diff your re-run against ours.
Per-subtask breakdown when the benchmark has sub-categories (MMLU has 57 subjects, TurkishMMLU has 9). Stored in per_subtask_json so you can spot if a model is great at history but weak at math.

What we don’t do

No unlabeled vendor scores. Quality leaderboards require a Gist proving we ran it (or a community contributor did, with a Gist). Hardware tok/s rows may include vendor-published or official figures only when they are labeled lower confidence and kept distinct from measured coverage.
No cherry-picking best-of-N. Every published score is greedy / temperature-0 / single sample. Pass@1, accuracy@1, etc. No best-of-3 to inflate.
No silent re-runs. The unique constraint (model, benchmark, quant, runtime, hardware) means a re-run upserts the existing row — we keep the latest score, not the best one. Older runs are still in the row history via tested_at.
No paid placement. Models don’t buy their way to the top. Sponsored entries are not a future feature; the moment they appear, the leaderboard’s value collapses.

How to submit a score

See /benchmarks/submit for the form. You’ll need:

A public Gist URL with the full runner output (run our script, or your own that follows the benchmark’s methodology, and paste output into a Gist).
The exact reproduction command you ran (so a third party can verify).
Model slug from our /models catalog, plus quantization, runtime, and hardware tag.
A submitter handle (GitHub or email) so we can credit you and ping if verification questions arise.

Citation

If you reference runlocalai benchmark data in a paper or article, please cite the underlying benchmark (HumanEval+, TurkishMMLU, etc.) plus our leaderboard URL with the access date:

runlocalai.co benchmark leaderboard, "2026-05-26".
URL: https://www.runlocalai.co/benchmarks/quality
JSON: https://www.runlocalai.co/api/v1/quality-benchmarks

Changes to this methodology

When we change scoring methodology (e.g., switch from loglikelihood to generative for a multi-choice task), the benchmark gets a new slug (e.g., turkish-mmlu-generative vs turkish-mmlu) so old and new scores aren’t silently comparable. The methodology section of each benchmark detail page documents exactly what changed.