RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Submit an evaluation
◯Community submitted(Eval-harness moderated)Editorial review · 1-7 days

Submit an evaluation

Submitting a reproducible lm-evaluation-harness score on a local model + runtime + hardware combination. Distinct from /submit/benchmark (tok/s + VRAM measurements).

Read the benchmark methodology checklist before submitting. Reproducibility is the design point.

What we accept

Standard tasks: MMLU, HellaSwag, ARC-Challenge, GSM8K, HumanEval, TruthfulQA. Other tasks will be reviewed editorially.

Local runners only: vLLM, llama.cpp, Ollama, MLX, SGLang, ExLlamaV2 with TabbyAPI. NOT cloud APIs. NOT closed weights.

Required metadata: exact command line, lm-evaluation-harness commit hash, runtime version, driver, quantization, context length. Raw harness output JSON preserved verbatim.

What we don't accept

Rejected submissions stay private to editorial.

  • Submissions without the exact command line — reproducibility is impossible without it.
  • Submissions without the lm-evaluation-harness commit hash — task semantics drift across major versions.
  • Submissions where the eval was run via a hosted API. We evaluate LOCAL runtimes only.
  • Open-ended generation evals (chat-arena style). The judge drifts; the gaming surface is obvious. Deferred indefinitely.
  • Submissions that mix eval scores with tok/s benchmarks into one number. Use /submit/benchmark for throughput; this form for correctness.
Must parse as JSON. We preserve verbatim and never mutate.
Version metadata (strongly encouraged — affects confidence tier)

Privacy

Email is optional. Used only for moderator follow-up + to notify you when your submission is reviewed. Email never renders publicly.

We hash your IP for rate-limiting (3 submissions per hour). Daily salt rotation. Raw IPs never persisted.