Reviewed quality benchmark scores for open-weight local models. First-party rows are produced on RunLocalAI hardware; community rows render only after review. Each public score carries a stated quantization, runtime, hardware target, and raw-log Gist.
164 Python coding problems with augmented test suites (~80 hidden tests per problem) from the EvalPlus extension of the original HumanEval benchmark. Measures functional code gener
| # | Model | Quant | Rig | Score | Trust | Log |
|---|---|---|---|---|---|---|
| 1 | Qwen 2.5 Coder 7B Instruct | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 81.1/100 | First-party command published | Gist → |
| 2 | Phi-4 14B | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 78.7/100 | First-party command published | Gist → |
| 3 | Trendyol LLM Asure 12B | Q4_K_M | ollama-0.24.0 rtx-5080 | 69.5/100 | First-party command published | Gist → |
| 4 | Llama 3.1 8B Instruct | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 56.1/100 | First-party command published | Gist → |
| 5 | Qwen 3 8B | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 2.4/100 | First-party command published | Gist → |
EvalPlus's augmented MBPP suite: introductory Python programming tasks with stronger hidden tests than the original MBPP benchmark. Measures short-form functional code generation q
| # | Model | Quant | Rig | Score | Trust | Log |
|---|---|---|---|---|---|---|
| 1 | Trendyol LLM Asure 12B | Q4_K_M | ollama-0.24.0 rtx-5080 | 71.7/100 | First-party command published | Gist → |
| 2 | Qwen 2.5 Coder 7B Instruct | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 66.9/100 | First-party command published | Gist → |
| 3 | Phi-4 14B | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 60.3/100 | First-party command published | Gist → |
| 4 | Llama 3.1 8B Instruct | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 39.2/100 | First-party command published | Gist → |
Turkish-translated Massive Multitask Language Understanding benchmark: 900 questions across 9 subjects (Biology, Chemistry, Geography, History, Mathematics, Philosophy, Physics, Re
| # | Model | Quant | Rig | Score | Trust | Log |
|---|---|---|---|---|---|---|
| 1 | Trendyol LLM Asure 12B | Q4_K_M | ollama-0.24.0 rtx-5080 | 58.9/100 | First-party command published | Gist → |
| 2 | Llama 3.2 3B Instruct | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 11.4/100 | First-party command published | Gist → |
| 3 | Turkish Llama 8B Instruct v0.1 | Q4_K_M | ollama-0.24 rtx-3080 | 11.0/100 | First-party command published | Gist → |
| 4 | Turkish Llama 8B Instruct v0.1 | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 11.0/100 | First-party command published | Gist → |
first-party (we ran it), verified (community submission reviewed with named verifier, timestamp, raw log, and reproduction command), pending (submitted but awaiting review and hidden from public leaderboards).