Community submitted(Eval-harness moderated)Editorial review · 1-7 days

Submit an evaluation

Submitting a reproducible lm-evaluation-harness score on a local model + runtime + hardware combination. Distinct from /submit/benchmark (tok/s + VRAM measurements).

Read the benchmark methodology checklist before submitting. Reproducibility is the design point.

What we accept

Standard tasks: MMLU, HellaSwag, ARC-Challenge, GSM8K, HumanEval, TruthfulQA. Other tasks will be reviewed editorially.

Local runners only: vLLM, llama.cpp, Ollama, MLX, SGLang, ExLlamaV2 with TabbyAPI. NOT cloud APIs. NOT closed weights.

Required metadata: exact command line, lm-evaluation-harness commit hash, runtime version, driver, quantization, context length. Raw harness output JSON preserved verbatim.

What we don't accept

Rejected submissions stay private to editorial.

Submissions without the exact command line — reproducibility is impossible without it.
Submissions without the lm-evaluation-harness commit hash — task semantics drift across major versions.
Submissions where the eval was run via a hosted API. We evaluate LOCAL runtimes only.
Open-ended generation evals (chat-arena style). The judge drifts; the gaming surface is obvious. Deferred indefinitely.
Submissions that mix eval scores with tok/s benchmarks into one number. Use /submit/benchmark for throughput; this form for correctness.

Model *

Hardware *

Runtime *

Task *

Dataset / subset

Metric *

Score *

Std-err (optional)

Quantization

Context length (tokens)

Command used *

Raw result JSON *

Must parse as JSON. We preserve verbatim and never mutate.

Version metadata (strongly encouraged — affects confidence tier)

lm-evaluation-harness commit

Runtime version

Driver

CUDA

ROCm

Metal

Notes (optional)

Your name (optional)

URL (optional)

Email (optional)

Privacy

Email is optional. Used only for moderator follow-up + to notify you when your submission is reviewed. Email never renders publicly.

We hash your IP for rate-limiting (3 submissions per hour). Daily salt rotation. Raw IPs never persisted.