Benchmark request policy
Anyone can ask for a specific (model × hardware × runtime) measurement that the catalog doesn’t yet cover. Operators with the gear can claim those asks and run them. Editorial moderates the whole loop. This page documents what happens at every stage — what gets accepted, what doesn’t, what a claim actually buys, and how privacy and moderation hold the market together.
What a request actually is
A benchmark request is a structured public ask for a specific measurement — usually one we don’t already have, or one an operator wants confirmed under different conditions. The form at /benchmarks/request takes a model, a hardware target, an optional runtime preference, an optional quant, and a one-line reason the measurement matters. That bundle is the request. Nothing more.
Three other surfaces sit nearby and are easy to confuse with this one. /benchmarks/wanted is the editorial-curated wishlist — rows we’ve identified as cohort gaps from inside the catalog, not requests an outside operator filed. /submit/benchmark is what you use when you ran the measurement yourself and want to publish the result. The request market sits between these two: a public ask, filed by anyone, that may or may not eventually become a published measurement once an operator picks it up.
The model has a deliberate asymmetry. Filing a request costs almost nothing — a form, no account, no credit. Running the measurement costs real time and real hardware. The market exists because that asymmetry is fine, as long as editorial moderation keeps the request side honest enough that operators trust the queue is worth scanning.
The lifecycle — pending, accepted, claimed, measured
Every request lives in exactly one of four states. Each transition is logged in the editorial moderation history and is reversible only with a second-editor signoff, the same audit discipline the verification policy applies to community benchmarks.
Pending — submitted, not yet reviewed. The request is in the moderation queue and is not visible on the public market view. Pending requests stay private to editorial and to the submitter. Review window is one to seven days; we don’t promise a same-day turnaround because the queue is moderated by humans, not by a rule engine.
Accepted — editorial reviewed the request, judged it useful, and added it to the public market. State moves from pending to accepted and the row appears at /benchmarks/wanted. Three things acceptance does not mean. It does not mean we will run the measurement ourselves. It does not mean an operator will ever claim it. It does not promise a timeline. Acceptance is a statement that the request is well-formed and useful enough to be worth surfacing — nothing more.
Claimed — an operator (verified or anonymous) has signaled they intend to run the measurement. Claims are visible on the public market view as a softer signal than acceptance, with a claimant pseudonym if one was provided. Critically: a claim creates no public credit by itself. An operator who claims twenty requests and finishes none of them earns nothing. Only a successfully measured + moderated submission lands as a contribution. We hold this line because the alternative — crediting intent — would let coordinated parties farm reputation by claiming everything in sight.
Measured — a benchmark submission landed at /submit/benchmark and was linked to the request. The request lifecycle ends here; the measurement itself goes through normal community moderation, which means it lands as Community submitted by default and may be promoted to Reproduced or Independently reproduced via the same reproduction discipline that governs every other community benchmark on the site. Linking is automatic when the submitter uses the form’s “answers request” field; manual linkage by editorial covers the cases where they didn’t.
What does NOT get accepted
A request that fits any of the following patterns is rejected at moderation. Rejection is private — rejected requests do not appear on the public market and the submitter is not named.
- Duplicate of an existing benchmark. If we already publish a recent measurement for the same model + same hardware + same runtime + same quant, the request is closed with a pointer to the existing row. “Recent” means within the last twelve months and not flagged Stale.
- Impossible setup. Asking for a 70B-parameter model at FP16 on a 4GB GPU is not a measurement, it’s a memory-error report. The will-it-run engine catches the most obvious cases at submission time; reviewers catch the subtler ones. Combinations that are technically possible but require extreme tradeoffs (dense 70B on 24GB, partial offload to system RAM with significant degradation) get a clarifying note rather than an outright reject.
- Too vague. Requests must specify at minimum a model and a hardware target. “I want a Llama benchmark on a 3090” is too vague — which Llama, which size, which quant family. Editorial bumps requests like this back with a specific clarification ask before re-considering.
- Already covered by an editorial benchmark. Editorial rows we measured ourselves satisfy any request for the same configuration; the request gets a pointer to the editorial row.
- Spam. Crypto-pump prompts, link-farm reasons, repeated submissions from the same IP hash with mechanical variance. Spam requests are rejected and the IP hash goes on the rate-limit watchlist.
Privacy and rate-limiting
The request form takes an optional email so we can notify you when an accepted request is measured. Emails are never public, never sold, never used for marketing. They’re transactional notification only and you can request deletion at any time via the contact page.
We hash submitter IPs at ingestion (a salted SHA-256, the salt rotated quarterly) and never store raw IPs. The hash exists to rate-limit submissions per source and to correlate spam campaigns; it cannot be reversed to an IP after the salt rotates. The hash is not exposed in any public surface.
Pseudonyms are honored. If you submit as “@gpu-nerd” the public market shows that string and only that string. We don’t doxx submitters who later become well-known under a real name; the pseudonym stays unless the submitter asks us to change it. Rejected requests stay private to editorial — if your request was rejected as spam or as too vague, no public record exists.
How requests improve confidence
Each measurement that lands on a previously-requested setup does three things for the catalog’s confidence engine. It lifts cohort coverage on the relevant cohort coverage report — one fewer gap, one more data point. It can fill a competitor gap when the request was filed against a model + hardware combo that other catalogs haven’t measured. And it validates workflows when the request was tied to a stack-level question, the kind documented in the workflow validation methodology.
The full mechanics live in the confidence methodology document. The short version: a request that gets answered is worth more than a request that doesn’t, and a request that gets answered with a reproduction is worth more than one answered with a single submission. The trust ladder applies the same way it does everywhere else, because the request market is upstream of the same machinery that produces every other number on the site — documented at /trust/benchmarks.
What this policy cannot do
Three honest acknowledgements about what the request market won’t deliver, no matter how clean the policy is.
It cannot guarantee any specific request will be measured. Acceptance is a statement that the request is useful, not that any operator will pick it up. Some popular configurations get claimed in hours. Some niche but legitimate ones sit accepted-but-unclaimed for months. We don’t promise a timeline because we don’t control the supply side. If your question is time-critical, the request market is the wrong tool — the right tool is renting the hardware on cloud and running it yourself, then submitting the result via /submit/benchmark.
It cannot pre-screen for measurement quality. Acceptance happens at the request stage, before any measurement exists. When the operator delivers, the resulting submission goes through the same moderation as any other community benchmark. A request being accepted does not put a thumb on the scale at submission review — if the numbers don’t hold up to the discipline rules in the reproduction guide, the submission lands at Community submitted at best, the same as any other.
It cannot fully prevent gaming. Coordinated parties can file legitimate-looking requests they intend to claim and publish themselves to manufacture coverage on a particular model. The IP-hash rate-limit catches the obvious cases; the moderation review catches more; but a sufficiently patient and distributed campaign can route around both. We accept this. The backstop is the same one the rest of the trust moat relies on — reproduction. A submission that no independent operator can reproduce never makes it past Community submitted regardless of how it got onto the queue, which limits what coordination can buy.
Adjacent reading
The request form is at /benchmarks/request; the public market view is at /benchmarks/wanted. The verification policy documents the four-state ladder for the resulting submissions, and the reproduction guide is the operator-side discipline for moving rows up that ladder. Trust transparency lives at /trust/benchmarks, and the confidence engine that ultimately weights every row is documented at /resources/confidence-methodology.
Next recommended step
Ask the catalog to surface a specific model x hardware x runtime measurement you need.
Back to /resources. See also /editorial-policy for the broader editorial discipline this policy operates inside.