Benchmark cohort coverage
The intelligence graph compares your benchmark to its cohort — same model, same hardware, same quant bucket, same context bucket. Cohorts under 5 measurements can't produce confident outlier flags. This page surfaces which cohorts have signal and which are underpowered.
The cohorts ranked first below are ones where one or two more measurements would unlock real intelligence. If you have the rig, the “reproduce” CTA on each row prefills the submission form.
Cohorts where one more measurement matters
Ranked: low / moderate confidence first, then proximity to the 5-row outlier-detection threshold, then recency. A measurement landing on any of these tips it across the line.
| Cohort | Confidence | Rows | Reproduced | Latest | Action |
|---|---|---|---|---|---|
unknown · 16-32K
| Low | 2 | 0 | 2026-05-06 | Reproduce → |
4-bit · 16-32K
| Low | 2 | 0 | 2026-05-06 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 0 | 2026-05-13 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 0 | 2026-05-13 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 0 | 2026-05-13 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 0 | 2026-05-13 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 0 | 2026-05-13 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 0 | 2026-05-13 | Reproduce → |
4-bit · ≤4K
| Low | 1 | 1 | 2026-05-11 | Reproduce → |
4-bit · 4-8K
| Low | 1 | 0 | 2026-05-10 | Reproduce → |
4-bit · 16-32K
| Low | 1 | 0 | 2026-05-06 | Reproduce → |
4-bit · 4-8K
| Low | 1 | 0 | 2026-05-05 | Reproduce → |
4-bit · 4-8K
| Low | 1 | 0 | 2026-05-04 | Reproduce → |
4-bit · 4-8K
| Low | 1 | 0 | 2026-05-04 | Reproduce → |
4-bit · 16-32K
| Low | 1 | 0 | 2026-05-03 | Reproduce → |
4-bit · 4-8K
| Low | 1 | 0 | 2026-05-02 | Reproduce → |
How cohort confidence is derived
Cohort labels mirror the per-benchmark confidence engine: low / moderate / high / very-high. Never percentages.
- Very-high: ≥5 measurements + ≥2 reproductions.
- High: ≥5 measurements, reproduction count low.
- Moderate: 3-4 measurements, below the outlier-detection threshold.
- Low: 1-2 measurements, single-source. The intelligence graph cannot draw conclusions.
A cohort that's last-touched >18 months ago gets demoted one tier — runtime + driver drift since then is real. A cohort that has only one runtime represented gets called out; runtime-drift signal is absent until a second runtime lands.
Next recommended step
Editorial-curated benchmark opportunities ranked by impact.