How we run benchmarks
The contract that every score on /benchmarks/quality is held to — whether we ran it or a community contributor did.
The trust gate
Quality benchmark leaderboards require raw public logs. Hardware tokens-per-second rows use the broader confidence ladder: source, operator, environment, reproduction state, and missing evidence are shown on the row, and sparse rows are labeled lower confidence.
A quality benchmark row appears on a public leaderboard if and only if it satisfies all of these conditions:
- A public Gist URL in the
test_run_log_urlfield, containing the full stdout + stderr of the run. No Gist, no public render. This is enforced at the SQL query level on every leaderboard page. - A reproducible runner in the public repo. Every benchmark we support has its runner script in scripts/. A third party can clone, install, and replicate independently.
- Explicit rig metadata recorded per row: model slug, quantization, runtime version, hardware slug. Same model at Q4_K_M vs Q6_K is a different row, not a silent average.
- A trust tier rendered on the row's badge (see below).
Trust tiers
runlocalai operator ran this score on a stated rig. The runner script + Gist log are the only evidence required. Default tier for everything we publish ourselves.
A community submission that an operator has independently checked. Verification usually means re-running on our own rig and confirming the score is within noise margin (~±2pp).
Independent submission with a Gist log and reproduction command, documented but not yet re-verified by us. Renders publicly because trust gate conditions are met; the badge tells the reader this is independent evidence.
Submitted via /benchmarks/submit but missing one or more trust-gate fields (Gist not posted, runner command unclear, etc.). Does NOT render publicly until corrected.
The reproduction contract
Every quality benchmark row carries the full reproduction context. Hardware tok/s rows carry these fields when available, and the public detail page marks missing fields as "Not provided":
- Runner script — the exact file in our public repo that produced this score (e.g.,
scripts/run-humaneval-plus.ts). - Reproduction command — the exact CLI invocation, including all flags (quantization, runtime, endpoint, hardware tag).
- Raw log Gist — stdout + stderr verbatim. Anyone can diff your re-run against ours.
- Per-subtask breakdown when the benchmark has sub-categories (MMLU has 57 subjects, TurkishMMLU has 9). Stored in
per_subtask_jsonso you can spot if a model is great at history but weak at math.
What we don’t do
- No unlabeled vendor scores. Quality leaderboards require a Gist proving we ran it (or a community contributor did, with a Gist). Hardware tok/s rows may include vendor-published or official figures only when they are labeled lower confidence and kept distinct from measured coverage.
- No cherry-picking best-of-N. Every published score is greedy / temperature-0 / single sample. Pass@1, accuracy@1, etc. No best-of-3 to inflate.
- No silent re-runs. The unique constraint (model, benchmark, quant, runtime, hardware) means a re-run upserts the existing row — we keep the latest score, not the best one. Older runs are still in the row history via
tested_at. - No paid placement. Models don’t buy their way to the top. Sponsored entries are not a future feature; the moment they appear, the leaderboard’s value collapses.
How to submit a score
See /benchmarks/submit for the form. You’ll need:
- A public Gist URL with the full runner output (run our script, or your own that follows the benchmark’s methodology, and paste output into a Gist).
- The exact reproduction command you ran (so a third party can verify).
- Model slug from our /models catalog, plus quantization, runtime, and hardware tag.
- A submitter handle (GitHub or email) so we can credit you and ping if verification questions arise.
Citation
If you reference runlocalai benchmark data in a paper or article, please cite the underlying benchmark (HumanEval+, TurkishMMLU, etc.) plus our leaderboard URL with the access date:
runlocalai.co benchmark leaderboard, "2026-05-26". URL: https://www.runlocalai.co/benchmarks/quality JSON: https://www.runlocalai.co/api/v1/quality-benchmarks
Changes to this methodology
When we change scoring methodology (e.g., switch from loglikelihood to generative for a multi-choice task), the benchmark gets a new slug (e.g., turkish-mmlu-generative vs turkish-mmlu) so old and new scores aren’t silently comparable. The methodology section of each benchmark detail page documents exactly what changed.