Trust moat · Versioned benchmarking

Versioned benchmarking — runtime, driver, CUDA, and OS metadata

A benchmark without a runtime version is a measurement of a moving target. Two operators on the “same” setup — same GPU, same model, same quant — can produce 30% different decode rates simply because one is on llama.cpp 1.0.7 and the other is on 1.0.8 with a re-tuned attention path. This page documents the metadata fields we track, why they matter, and how the regression-candidate detector turns version churn into a useful signal instead of noise.

Editorial(methodology)Operator-reviewed
By Fredoline Eruo · Last reviewed 2026-05-08

The five version fields we track

Every benchmark row carries up to five version fields beyond the obvious model and hardware. They’re collected at submission time, surfaced on the row itself, and consumed by the confidence-engine and the regression-candidate detector.

  • Runtime version. The exact build of the inference engine — e.g. llama.cpp b3421, vLLM 0.6.4, ExLlamaV2 0.2.3, MLX-LM 0.18.0. Free-form text on the form; the moderation pass canonicalises it before publish so two reports of vLLM v0.6.4 and vllm 0.6.4 collapse to one bucket.
  • Driver version. NVIDIA (555.42), AMD (ROCm 6.2.1), or Apple (the macOS build for MLX runs since the driver and the OS ship together on Apple Silicon).
  • CUDA / ROCm / Metal version. The compute stack the runtime was actually compiled against. This field decouples “driver” from “compute toolkit” — you can run the 555 driver against CUDA 12.4 binaries and CUDA 12.6 binaries, and the two produce measurably different numbers on some kernels.
  • OS label. Distribution + version, or macOS build, or the Windows feature update ring. The label is a string (Ubuntu 24.04, macOS 15.2, Windows 11 23H2), not a parsed kernel number, because the kernel number stops being a useful signal once you cross distros.
  • Model build identifier. The specific GGUF / GPTQ / AWQ artefact — ideally the SHA-256 of the file, or at minimum the publisher + filename. Two GGUFs with the same nominal name can differ by quantisation calibration; the build ID lets us tell.

Why every field matters

Each field corresponds to a real source of measurement variance an operator might encounter when reproducing a benchmark on “the same” configuration.

Runtime version is the largest single mover in practice. The inference-engine ecosystem ships kernel rewrites every few weeks. vLLM’s 0.6 → 0.7 transition rewrote attention; llama.cpp ships flag-default changes that move steady-state decode by 10–20% on the same hardware; SGLang’s scheduler has had two major reworks in the last year. A benchmark on a six-month-old runtime is not directly comparable to one on the current runtime.

Driver version matters most on NVIDIA, where the major-branch jump (e.g. 535 → 555 → 560) sometimes brings a new CUDA driver runtime that downstream toolkits compile against. ROCm has the same problem on AMD: 6.0, 6.1, 6.2 are not equivalent for production inference. Metal/MLX is mostly OS-coupled, which is why the OS label carries the signal there.

CUDA / ROCm / Metal version shows up most often in anomalies: a row that’s 15% slower than the cluster median, otherwise identical configuration, often differs only in the toolkit version the runtime was compiled against. Without this field we can’t investigate.

OS label matters at the margins — kernel scheduler tweaks, default frequency-governor settings, IO scheduler defaults, and the small but real performance cost of Windows’s default GPU scheduling vs. Linux’s. We don’t weigh this heavily, but rows that disagree with the cluster on OS get an investigation flag.

Model build ID is the field most often left blank and most often the explanation for genuine outliers. Two GGUFs of Llama-3.1-8B-Q4_K_M from different uploaders can differ by 3–5% on quality benchmarks because the importance-matrix calibration differs. The numbers aren’t lying; they’re measuring different artefacts.

How missing fields are handled

Many community submissions arrive with fields blank. We never invent a value. Instead the submission flows through three rules:

  • Blank fields are rendered as blank, not assumed. The detail page shows a dash (—) where a version is missing. The form encourages the submitter to fill it in but doesn’t block submission — getting the row queued matters more than having every field perfect on day one.
  • The confidence engine penalises blanks. Each missing version field reduces the row’s confidence score (see /resources/confidence-methodology). A row with all five version fields blank caps at low confidence regardless of how many reproductions it accumulates, because no reproducer can match the configuration cleanly.
  • Reproduction lets the row inherit metadata. When a reproducer publishes their own run with full version metadata and the numbers match the original within reasonable tolerance, the editorial pass can attach those versions to the original row as “observed configuration during reproduction.” The original submitter gets credit; the row gets fields it didn’t have before.

Regression candidates — not confirmed regressions

The regression-candidate detector lives at src/lib/benchmarks/intelligence/regression.ts. It scans for a specific pattern: a benchmark configuration where rows on a newer runtime version are systematically slower than rows on an older one, controlling for hardware. When it finds one, it surfaces the configuration on the benchmark intelligence panel as a candidate, not a confirmed regression.

The distinction matters. A genuine performance regression in an inference engine is a real thing that happens, and surfacing it early is one of the most useful things this catalog can do for operators. But a single row that’s slower on a newer runtime is not a regression — it might be a different driver, a different CUDA build, a different model artefact, a different thermal envelope, or a benchmark that was simply less rigorous. The candidate label is an invitation to investigate, not a verdict.

Two thresholds drive promotion from candidate to confirmed: first, three or more independent reports across two or more operators must agree that the newer version is slower at the same configuration; second, the gap must persist after controlling for the other version fields. Once both hit, editorial promotes the finding, and it shows up on the runtime’s tool page with the affected version range.

How version drift demotes confidence

Version drift is the gap between when a benchmark was measured and the current state of the runtime ecosystem. It enters the confidence engine through two channels.

First, an editorial annotation table records when a runtime ships a major version with kernel changes that genuinely move performance. When that annotation lands, all rows on the prior version drop a confidence tier automatically. We don’t retroactively republish the old numbers as “equivalent on the new version” — nobody’s actually re-run them yet, so the trust pill correctly demotes.

Second, the row’s age and the runtime’s release velocity combine. A 12-month-old benchmark on llama.cpp drifts differently from a 12-month-old benchmark on TensorRT-LLM, because the two runtimes ship at very different cadences. The confidence engine knows this, and the older the row gets relative to its runtime’s release pace, the more its tier slides.

What this methodology cannot tell you

Several real failure modes sit outside what version metadata can catch.

  • Compiler flags and build provenance. Two builds of llama.cpp at the same git tag can perform differently if one was built with a different compiler or with non-default architecture flags. We don’t collect this; the field would be filled in by approximately zero submitters and would just create more blank fields.
  • Thermal and power state. A GPU running at its thermal limit produces different numbers from the same GPU in a chilled rack. We capture nothing here. The reproduction guide calls out warmup and ambient conditions, but the row itself doesn’t encode them.
  • Background workload. The submitter’s machine running another VRAM tenant during the benchmark quietly sabotages the result. We can’t see that and can’t exclude it.
  • BIOS / firmware. Memory training, PCIe link state, NVIDIA Resizable BAR settings — all matter, none tracked. Editorial measurements record the BIOS version in our internal notes; we don’t expose the field on the public form because it’d be left blank 95% of the time and create false signal.

Adjacent reading

This page sits in a small constellation of trust documentation. The confidence methodology documents how missing fields, age, and version drift translate into the four-tier confidence ladder; the verification policy documents the discrete state machine for community submissions; the reproduction guide documents the operator protocol for promoting a row up that ladder, including which version fields a reproducer should record. The runtime health methodology is the sibling page for how we score the underlying engines that produce these versions in the first place.

Next recommended step

How missing version fields, age, and runtime drift translate into the four-tier confidence ladder.

Back to /resources. See also /editorial-policy and /changelog for any methodology revisions.