BLK · CODING BENCHMARKSfirst-party · per-quant · reproducible
Coding benchmark leaderboard
Reviewed HumanEval+ scores for open-weight local models at real-world quantizations. First-party rows were run by RunLocalAI; community rows render only after review. Every public row links to its raw test-run Gist.
Test runs
9
Unique models
5
Quant variants
1
Benchmark suites
2
METHODOLOGY
How these numbers are produced
- Each model loads into the listed runtime (Ollama / vLLM / llama.cpp) at the listed quantization on the listed hardware.
- EvalPlus is used as the official scorer. For OpenAI-compatible local runtimes, we first generate deterministic JSONL samples, then score them with
python -m evalplus.evaluate. - Greedy sampling (deterministic; one sample per problem). Score = pass@1 on HumanEval+ (164 problems with augmented tests) × 100.
- Raw stdout + stderr captured and posted to a public GitHub Gist before the row enters this table. Click any score to verify.
- Runner source:
scripts/run-humaneval-plus.ts(public repo) — reproducible end-to-end.
| # | Model | Benchmark | Quant | Rig | Score | Trust | Log |
|---|---|---|---|---|---|---|---|
| 1 | Qwen 2.5 Coder 7B Instruct tested 2026-05-28 | HumanEval+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 81.1 | First-party command published | Gist → |
| 2 | Phi-4 14B tested 2026-05-28 | HumanEval+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 78.7 | First-party command published | Gist → |
| 3 | Trendyol LLM Asure 12B tested 2026-05-27 | MBPP+ | Q4_K_M | ollama-0.24.0 rtx-5080 | 71.7 | First-party command published | Gist → |
| 4 | Trendyol LLM Asure 12B tested 2026-05-27 | HumanEval+ | Q4_K_M | ollama-0.24.0 rtx-5080 | 69.5 | First-party command published | Gist → |
| 5 | Qwen 2.5 Coder 7B Instruct tested 2026-05-29 | MBPP+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 66.9 | First-party command published | Gist → |
| 6 | Phi-4 14B tested 2026-05-29 | MBPP+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 60.3 | First-party command published | Gist → |
| 7 | Llama 3.1 8B Instruct tested 2026-05-28 | HumanEval+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 56.1 | First-party command published | Gist → |
| 8 | Llama 3.1 8B Instruct tested 2026-05-29 | MBPP+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 39.2 | First-party command published | Gist → |
| 9 | Qwen 3 8B tested 2026-05-29 | HumanEval+ | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 2.4 | First-party command published | Gist → |
A note on quant comparisons. The same model at different quantization levels can score materially differently — especially on code, where precision loss hurts. When you see Q4_K_M next to Q6_K for the same model, the gap is the “quant tax” on coding specifically. Take that into account when picking what to run on /will-it-run.