BLK · CODING BENCHMARKSfirst-party · per-quant · reproducible

Coding benchmark leaderboard

Reviewed HumanEval+ scores for open-weight local models at real-world quantizations. First-party rows were run by RunLocalAI; community rows render only after review. Every public row links to its raw test-run Gist.

Test runs
9
Unique models
5
Quant variants
1
Benchmark suites
2
METHODOLOGY

How these numbers are produced

  1. Each model loads into the listed runtime (Ollama / vLLM / llama.cpp) at the listed quantization on the listed hardware.
  2. EvalPlus is used as the official scorer. For OpenAI-compatible local runtimes, we first generate deterministic JSONL samples, then score them with python -m evalplus.evaluate.
  3. Greedy sampling (deterministic; one sample per problem). Score = pass@1 on HumanEval+ (164 problems with augmented tests) × 100.
  4. Raw stdout + stderr captured and posted to a public GitHub Gist before the row enters this table. Click any score to verify.
  5. Runner source: scripts/run-humaneval-plus.ts (public repo) — reproducible end-to-end.
#ModelBenchmarkQuantRigScoreTrustLog
1Qwen 2.5 Coder 7B Instruct
tested 2026-05-28
HumanEval+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
81.1
First-party
command published
Gist →
2Phi-4 14B
tested 2026-05-28
HumanEval+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
78.7
First-party
command published
Gist →
3Trendyol LLM Asure 12B
tested 2026-05-27
MBPP+
Q4_K_M
ollama-0.24.0
rtx-5080
71.7
First-party
command published
Gist →
4Trendyol LLM Asure 12B
tested 2026-05-27
HumanEval+
Q4_K_M
ollama-0.24.0
rtx-5080
69.5
First-party
command published
Gist →
5Qwen 2.5 Coder 7B Instruct
tested 2026-05-29
MBPP+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
66.9
First-party
command published
Gist →
6Phi-4 14B
tested 2026-05-29
MBPP+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
60.3
First-party
command published
Gist →
7Llama 3.1 8B Instruct
tested 2026-05-28
HumanEval+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
56.1
First-party
command published
Gist →
8Llama 3.1 8B Instruct
tested 2026-05-29
MBPP+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
39.2
First-party
command published
Gist →
9Qwen 3 8B
tested 2026-05-29
HumanEval+
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
2.4
First-party
command published
Gist →

A note on quant comparisons. The same model at different quantization levels can score materially differently — especially on code, where precision loss hurts. When you see Q4_K_M next to Q6_K for the same model, the gap is the “quant tax” on coding specifically. Take that into account when picking what to run on /will-it-run.