Trust moat · Reproduction protocol

How to reproduce a RunLocalAI benchmark

Every benchmark on the catalog carries a trust badge — editorial, community-submitted, reproduced, independently-reproduced. This page documents the protocol that lifts a row from one tier to the next and what your reproduction needs to look like for it to count.

By Fredoline Eruo · Last reviewed 2026-05-07

What reproduction actually means

Reproduction is not “I got a similar number on different hardware.” It is the same model, the same quant, the same runtime, on similar or identical hardware, run with discipline. The whole point of the trust ladder is that the words mean something — a badge that reads Reproduced tells the next operator that someone other than the original submitter ran the same configuration and saw the same numbers. Loosening the matching set would dissolve the signal.

Four dimensions need to match before your run counts as a reproduction:

Model. Same family, same parameter count, same instruct/base variant. Llama-3.1-8B-Instruct and Llama-3.1-8B (base) are different rows. Quants from different providers (Bartowski vs Unsloth) are usually close enough — note the alternate provider in your submission.
Hardware. Same GPU SKU. RTX 3090 and 3090 Ti are different rows. CPU inference reproductions need the same CPU family + RAM tier.
Quant. Same bit-width and same family. GGUF Q4_K_M, AWQ-int4, and GPTQ-int4 are not interchangeable.
Runtime. Same engine. llama.cpp and Ollama (which wraps llama.cpp) are typically interchangeable; vLLM and llama.cpp are not.

If three of four match and one drifts, what you have is a related benchmark, not a reproduction. Submit it as a fresh row instead of through the reproduce flow.

The ten-step reproduction protocol

Follow these in order. Skipping a step is fine if you understand what it controls for; skipping multiple steps means your submission will likely land at community-submitted rather thanreproduced.

Read the original submission’s metadata in full. On the public detail page, expand the build context block — model slug, hardware slug, quant, runtime version, driver version, OS label, concurrent users count. Note the values; you’ll need to match them.
Confirm your environment matches the matching set above. If your hardware or runtime drifts, switch to a fresh-submission flow rather than a reproduction flow.
Update your runtime to the recorded version. If the original ran on llama.cpp 1.0.4 and you’re on 1.0.7, pin back. Runtime minor versions matter; we’ve seen 8% shifts on the same hardware across a single CUDA bump.
Capture your driver / CUDA / ROCm / Metal version. Run nvidia-smi, rocm-smi, orsystem_profiler SPDisplaysDataType and record the version string verbatim.
Warm up. Send a 200-token prompt and discard the response. This loads the model into VRAM and primes the KV cache allocator. First-run numbers are unreliable.
Time the actual benchmark. Send a fresh 500-token prompt with max_tokens=512 and a deterministic seed. Record T0 (request sent), T1 (first token received), T2 (response complete).
Compute TTFT and decode rate honestly. TTFT is T1 - T0 in milliseconds. Decode rate is tokens_generated / (T2 - T1) in tok/s. Don’t use (T2 - T0) — that conflates prefill with decode and inflates your number on long prompts.
Repeat the timing run three times. Single-run numbers are noise. Sample three, report the median. Keep the spread handy; if it’s wider than 10% across runs, document why in your submission notes.
Capture VRAM peak with --loop=1. Use nvidia-smi --query-gpu=memory.used --loop=1 or equivalent throughout the run and record the high-water mark. Long-context workloads spike during prefill; that spike is what determines whether a build OOMs.
Submit through the reproduce flow. On the original benchmark detail page, click “Reproduce this benchmark.” The form pre-fills the model + hardware + tool + quant fields and locks them so the linkage is preserved. Fill in your numbers, your runtime version, your driver version, your OS label. Submit.

What to expect after you submit

Editorial review for community submissions takes 1–7 days. Your row enters the queue at status queued; once a reviewer works through it the row transitions either toapproved-public (renders publicly with a community-submitted badge), reproduced (the reviewer judged your run a clean reproduction of the original; the original’s trust badge lifts a tier), or in rare cases to rejected with a one-line reason.

The verification policy at /resources/verification-policy documents the four states in detail and the criteria for each transition. The confidence engine at /resources/confidence-methodology documents how a successful reproduction propagates to the row’s public confidence tier.

When your numbers don’t match

Some divergence is expected and informative. The cleanest mental model:

Within 10% of the original. Your run reproduced. Submit it. Most clean reproductions land here — the same hardware + runtime + quant + model produces the same number within noise.
10–25% divergence. Likely a real environmental difference. The usual suspects: a different driver branch, background workload competing for the GPU, thermal envelope (consumer cards throttle harder in warm rooms), BIOS power-limit setting. Submit anyway with the delta documented in the notes field; reviewers will decide whether it lifts the original to independently-reproduced or whether it should fragment into a fresh row.
More than 25% divergence. Something is genuinely different. Don’t submit through the reproduce flow. Submit a new benchmark with full context, and consider opening a feedback note against the original via /submit/feedback so reviewers can investigate whether the original is stale or wrong.

Numbers that diverge by 50%+ usually indicate a mismatch in the matching set the submitter didn’t notice — different quant family, different instruct variant, a fresh-driver-vs-old-driver gap wide enough to count as a different configuration. Re-check the matching set before submitting.

Why the protocol is this strict

The trust-badge system is the spine of how the catalog earns operator trust. A visitor reading a benchmark page should be able to glance at the badge and know whether the number is a measurement, a single-source claim, or a cross-validated result. That signal only works if the words mean something — a row marked Reproduced has to be a row that someone actually reproduced, not a row that looked similar to a related result.

We err on the side of leaving rows at lower trust tiers rather than promoting them on weak evidence. A community submission that we publish as community-submitted is honest about what it is. A community submission we promoted to reproduced on a sloppy match would be dishonest, and the trust-moat work is worthless if we cheat at the boundary.

Ready to submit?

Click into the public detail page of the benchmark you want to reproduce — the “Reproduce this benchmark” button on every approved row preserves the linkage automatically. Or, if you’re submitting a fresh benchmark with no original to anchor to, head straight to /submit/benchmark.

Frequently asked questions

Do I have to give my name? No. The default posture is anonymous. The submission form does not require name or email, and the IP-hash mechanism we use for rate-limiting is not the same as identifying you — we hash IPs on ingestion and never store the raw value. If you do want public credit, the optional submitter URL field renders your name as a link with rel="nofollow noopener" on the public detail page.

What happens to rejected submissions? They stay in the database with a moderation note explaining why, but they don’t render publicly. The audit trail protects against accusations of selective moderation. The most common rejection reasons are missing runtime metadata (a row a reproducer can’t match against), implausible numbers that exceed the hardware’s memory-bandwidth ceiling, and dishonest framing (claiming editorial measurement, misrepresenting concurrent-user counts). The full criteria are in the verification policy.

How long does the badge upgrade take? Once your reproduction is approved, the original benchmark’s public badge updates on the next page render — there’s no separate publish step. The confidence tier in the methodology engine may take longer to fully reflect the new state because some of the confidence factors (variance across multiple rows, runtime-version drift) need additional submissions to compute. The badge moves first; the confidence tier follows.

Adjacent reading: /resources/verification-policy for the four-state ladder, /resources/confidence-methodology for how successful reproductions propagate to confidence tiers, and /resources/scoring-methodology for the v17 catalog-score engine that sits alongside this trust layer.