Methodology

Every methodology surface on RunLocalAI in one place — the rules behind every benchmark badge, every confidence tier, every regression candidate, every reproduction request. We document the rules so you can disagree with them specifically, not vaguely.

Anything that's missing? Tell us at /submit/feedback — we publish methodology corrections.

Trust + scoring

Scoring methodology →

How tok/s + VRAM scores combine with confidence tiers into the editorial verdict. The four-state verification ladder for community submissions.

Confidence methodology →

Why confidence is a 4-tier vocabulary (low / moderate / high / very-high) and never a percentage. derivedFrom rationale per row.

Verification policy →

What editorial does before a community submission goes public. What gets rejected. Why some submissions sit pending for days.

Trust — benchmarks →

What the benchmark badges mean, how operator-grade scoring differs from synthetic vendor scoring, and where we admit uncertainty.

Benchmarks

Reproduction guide →

How to reproduce an existing benchmark step-by-step so your run lifts the row's confidence tier.

Benchmark protocol (V36.51) →

Operator-grade capture protocol used on the 6 owned rigs: cold-start prep, standard prompt, 3-run median, sha256 reproducibility hash, mandatory environment metadata (driver + runtime + OS). The exact recipe behind every source=owner row.

Benchmark methodology checklist →

10-item printable checklist: what to measure, what NOT to claim, sampler-pinning, version capture, reproducibility hygiene.

Regression methodology →

How /benchmarks/regressions reaches 'possible regression' / 'possible improvement' / 'insufficient'. Threshold rules + known noise sources.

Benchmark lifecycle diagram →

8-stage flow: request submitted → accepted → claimed → submitted → moderated → approved-public → reproduced → independently-reproduced.

Reproduction network v2 →

How distinct-operator + distinct-hardware reproductions lift confidence from moderate → high → very-high.

Editorial policy

Benchmark request policy →

What we accept on /benchmarks/request, what we reject, what counts as an editorial-priority signal.

Engine choice matrix →

Single-table view of every major runtime across 13 operational dimensions (maintenance burden, reproducibility, lock-in, OS support, etc).

Trust — editorial →

Who writes here, how named-author bylines work, and what 'editorial' versus 'community' provenance means on a benchmark row.

Trust — operators →

How operator profiles get the 'verified-owner' signal. Why we don't accept self-attested verification.

Why methodology lives in one place

Editorial credibility lives or dies on whether you can defend every score with a specific rule. We publish the rules so a knowledgeable operator can disagree with this rule on this page rather than handwaving "your scoring seems off."

None of these methodologies are settled forever. We update them when contributors point out flaws. Every change ships in the changelog.

Next recommended step

Browse benchmarks

OrBrowse evaluations Submit feedback / corrections