Turkish-translated Massive Multitask Language Understanding benchmark: 900 questions across 9 subjects (Biology, Chemistry, Geography, History, Mathematics, Philosophy, Physics, Religion & Ethics, Turkish Language & Literature). We use the generative variant: the model emits a letter (A-E) as text and we parse it. Comparable to other generative MMLU runs.
How we run TurkishMMLU at runlocalai
Why generative, not loglikelihood: The traditional lm-evaluation-harness approach uses loglikelihood scoring, which requires the inference backend to return token logprobs. Ollama's chat/completions endpoint does not return logprobs, so we use a generative letter-pick approach. Generative MMLU scores typically land within ~3pp of loglikelihood MMLU for well-instructed models.
Runner source: scripts/run-turkish-mmlu.ts and scripts/turkish_mmlu_generative.py in the public repo.
Score is overall accuracy percentage across all 900 questions. Random baseline is 20% (5 answer choices). Strong models score 40-60%; weak/broken setups score below 20% (suggests context overflow or chat template mismatch).
| # | Model | Quant | Rig | Score | Trust | Log |
|---|---|---|---|---|---|---|
| 1 | Trendyol LLM Asure 12B 11.8B · gemma | Q4_K_M | ollama-0.24.0 rtx-5080 | 58.9 | First-party runlocalai command published | Gist → |
| 2 | Llama 3.2 3B Instruct 3B · llama | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 11.4 | First-party runlocalai command published | Gist → |
| 3 | Turkish Llama 8B Instruct v0.1 8B · llama | Q4_K_M | ollama-0.24 rtx-3080 | 11.0 | First-party runlocalai command published | Gist → |
| 4 | Turkish Llama 8B Instruct v0.1 8B · llama | Q4_K_M | ollama-0.24 rtx-3080-16gb-mobile | 11.0 | First-party runlocalai command published | Gist → |
| Model | Biology | History | Physics | Chemistry | Geography | Philosophy | Mathematics | Religion and Ethics | Turkish Language and Literature |
|---|---|---|---|---|---|---|---|---|---|
| Trendyol LLM Asure 12B (Q4_K_M) | 56 | 65 | 37 | 47 | 70 | 84 | 35 | 83 | 53 |
| Llama 3.2 3B Instruct (Q4_K_M) | 8 | 16 | 12 | 6 | 7 | 15 | 11 | 6 | 22 |
| Turkish Llama 8B Instruct v0.1 (Q4_K_M) | 13 | 11 | 11 | 6 | 11 | 15 | 10 | 10 | 12 |
First-party text-only TurkishMMLU generative run on local Ollama tag alibayram/Trendyol-LLM-Asure-12B:latest. Source model card: alibayram/Trendyol-LLM-Asure-12B; local GGUF source: alibayram/Trendyol-LLM-Asure-12B-Q4_K_M-GGUF. Hardware: RTX 5080 16GB, NVIDIA driver 595.97.
English-trained 3B baseline comparison vs Turkish-specialized 8B. Run on RTX 3080 Laptop 16GB, num_ctx=8192. Expected to score near random (20%) or below since model has no Turkish specialization.
Baseline run on Ollama 0.24 with default 2048 context window. Score is below the 20% random-guess baseline — strong indicator that 5-shot Turkish prompts (which average ~2000 tokens due to morphology) were silently truncated by Ollama. Re-run with --num-ctx 8192 expected to land 30-45%. Published as-is so the methodology improvement is measurable; this row is intentionally NOT promoted to 'verified'.
Re-run on RTX 3080 Laptop (16 GB) with `num_ctx=8192` to test the earlier hypothesis that the prior 11% score was caused by Ollama's default 2048-context window truncating 5-shot Turkish prompts. The re-run **landed at the same 11.00%**, ruling out the truncation hypothesis. The honest reading: Turkish-Llama-8B-Instruct-v0.1 was trained as a **Turkish conversational** model, not a multi-choice reasoning model. It speaks Turkish fluently but underperforms even the 20% random-guess baseline on TurkishMMLU's scientific/historical/literary subjects. Per-subject results: Biology 13%, Chemistry 6%, Geography 15% est., History 11%, Mathematics 12-15% est., Philosophy 15%, Physics 11%, Religious Culture & Ethics 10%, Turkish Language & Literature 12%. Use this model for chat/customer-service Turkish, not for structured Q&A. Higher-knowledge Turkish models (Trendyol Asure 12B at 58.89%) are the right anchor for general-knowledge use cases.