BLK · QUALITY LEADERBOARDSreviewed · per-quant · raw logs

Quality benchmarks

Reviewed quality benchmark scores for open-weight local models. First-party rows are produced on RunLocalAI hardware; community rows render only after review. Each public score carries a stated quantization, runtime, hardware target, and raw-log Gist.

Benchmarks tracked

Total test runs

Unique models

First-party

By trust tier:first-party: 13·verified: 0·community: 0·pending: 0·rejected: 0

+Submit a score Methodology →JSON API →·Inference speed leaderboard (tok/s) →

BENCHMARK · HUMANEVAL-PLUS

HumanEval+ (EvalPlus)

164 Python coding problems with augmented test suites (~80 hidden tests per problem) from the EvalPlus extension of the original HumanEval benchmark. Measures functional code gener

5 test runs

Coding

#	Model	Quant	Rig	Score	Trust	Log
1	Qwen 2.5 Coder 7B Instruct	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	81.1/100	First-party command published	Gist →
2	Phi-4 14B	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	78.7/100	First-party command published	Gist →
3	Trendyol LLM Asure 12B	Q4_K_M	ollama-0.24.0 rtx-5080	69.5/100	First-party command published	Gist →
4	Llama 3.1 8B Instruct	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	56.1/100	First-party command published	Gist →
5	Qwen 3 8B	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	2.4/100	First-party command published	Gist →

Full HumanEval+ (EvalPlus) leaderboard →

BENCHMARK · MBPP-PLUS

MBPP+ (EvalPlus)

EvalPlus's augmented MBPP suite: introductory Python programming tasks with stronger hidden tests than the original MBPP benchmark. Measures short-form functional code generation q

4 test runs

Coding

#	Model	Quant	Rig	Score	Trust	Log
1	Trendyol LLM Asure 12B	Q4_K_M	ollama-0.24.0 rtx-5080	71.7/100	First-party command published	Gist →
2	Qwen 2.5 Coder 7B Instruct	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	66.9/100	First-party command published	Gist →
3	Phi-4 14B	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	60.3/100	First-party command published	Gist →
4	Llama 3.1 8B Instruct	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	39.2/100	First-party command published	Gist →

Full MBPP+ (EvalPlus) leaderboard →

BENCHMARK · TURKISH-MMLU-GENERATIVE

TurkishMMLU (Generative)

Turkish-translated Massive Multitask Language Understanding benchmark: 900 questions across 9 subjects (Biology, Chemistry, Geography, History, Mathematics, Philosophy, Physics, Re

4 test runs

Turkish

General knowledge

Multilingual

#	Model	Quant	Rig	Score	Trust	Log
1	Trendyol LLM Asure 12B	Q4_K_M	ollama-0.24.0 rtx-5080	58.9/100	First-party command published	Gist →
2	Llama 3.2 3B Instruct	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	11.4/100	First-party command published	Gist →
3	Turkish Llama 8B Instruct v0.1	Q4_K_M	ollama-0.24 rtx-3080	11.0/100	First-party command published	Gist →
4	Turkish Llama 8B Instruct v0.1	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	11.0/100	First-party command published	Gist →

Full TurkishMMLU (Generative) leaderboard →

HOW SCORES EARN PUBLIC RENDER

The trust gate

Every row must link to a public Gist with the raw stdout + stderr of the run. No Gist, no public render.
Trust tier shown per row: first-party (we ran it), verified (community submission reviewed with named verifier, timestamp, raw log, and reproduction command), pending (submitted but awaiting review and hidden from public leaderboards).
Reproduction commands published so a third party can replicate independently. Full methodology lives at /benchmarks/methodology.
Same model + quant + runtime + hardware tuple is unique. Anonymous submissions cannot overwrite an existing public row; reviewers decide whether a new run replaces or supplements the record.