Submitting a reproducible lm-evaluation-harness score on a local model + runtime + hardware combination. Distinct from /submit/benchmark (tok/s + VRAM measurements).
Read the benchmark methodology checklist before submitting. Reproducibility is the design point.
Standard tasks: MMLU, HellaSwag, ARC-Challenge, GSM8K, HumanEval, TruthfulQA. Other tasks will be reviewed editorially.
Local runners only: vLLM, llama.cpp, Ollama, MLX, SGLang, ExLlamaV2 with TabbyAPI. NOT cloud APIs. NOT closed weights.
Required metadata: exact command line, lm-evaluation-harness commit hash, runtime version, driver, quantization, context length. Raw harness output JSON preserved verbatim.
Rejected submissions stay private to editorial.
Email is optional. Used only for moderator follow-up + to notify you when your submission is reviewed. Email never renders publicly.
We hash your IP for rate-limiting (3 submissions per hour). Daily salt rotation. Raw IPs never persisted.