BLK · CODING BENCHMARKSfirst-party · per-quant · reproducible

Coding benchmark leaderboard

Reviewed HumanEval+ scores for open-weight local models at real-world quantizations. First-party rows were run by RunLocalAI; community rows render only after review. Every public row links to its raw test-run Gist.

Test runs

Unique models

Quant variants

Benchmark suites

METHODOLOGY

How these numbers are produced

Each model loads into the listed runtime (Ollama / vLLM / llama.cpp) at the listed quantization on the listed hardware.
EvalPlus is used as the official scorer. For OpenAI-compatible local runtimes, we first generate deterministic JSONL samples, then score them with python -m evalplus.evaluate.
Greedy sampling (deterministic; one sample per problem). Score = pass@1 on HumanEval+ (164 problems with augmented tests) × 100.
Raw stdout + stderr captured and posted to a public GitHub Gist before the row enters this table. Click any score to verify.
Runner source: scripts/run-humaneval-plus.ts (public repo) — reproducible end-to-end.

#	Model	Benchmark	Quant	Rig	Score	Trust	Log
1	Qwen 2.5 Coder 7B Instruct tested 2026-05-28	HumanEval+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	81.1	First-party command published	Gist →
2	Phi-4 14B tested 2026-05-28	HumanEval+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	78.7	First-party command published	Gist →
3	Trendyol LLM Asure 12B tested 2026-05-27	MBPP+	Q4_K_M	ollama-0.24.0 rtx-5080	71.7	First-party command published	Gist →
4	Trendyol LLM Asure 12B tested 2026-05-27	HumanEval+	Q4_K_M	ollama-0.24.0 rtx-5080	69.5	First-party command published	Gist →
5	Qwen 2.5 Coder 7B Instruct tested 2026-05-29	MBPP+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	66.9	First-party command published	Gist →
6	Phi-4 14B tested 2026-05-29	MBPP+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	60.3	First-party command published	Gist →
7	Llama 3.1 8B Instruct tested 2026-05-28	HumanEval+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	56.1	First-party command published	Gist →
8	Llama 3.1 8B Instruct tested 2026-05-29	MBPP+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	39.2	First-party command published	Gist →
9	Qwen 3 8B tested 2026-05-29	HumanEval+	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	2.4	First-party command published	Gist →

A note on quant comparisons. The same model at different quantization levels can score materially differently — especially on code, where precision loss hurts. When you see Q4_K_M next to Q6_K for the same model, the gap is the “quant tax” on coding specifically. Take that into account when picking what to run on /will-it-run.