TurkishMMLU (Generative)

Turkish-translated Massive Multitask Language Understanding benchmark: 900 questions across 9 subjects (Biology, Chemistry, Geography, History, Mathematics, Philosophy, Physics, Religion & Ethics, Turkish Language & Literature). We use the generative variant: the model emits a letter (A-E) as text and we parse it. Comparable to other generative MMLU runs.

Test runs

Metric

accuracy

Range

/100

Best score

58.9

+Submit a score Runner script →Original source →Vendor leaderboard →JSON →

METHODOLOGY

How we run TurkishMMLU at runlocalai

Load the AYueksel/TurkishMMLU dataset from HuggingFace.
For each of 9 subtasks, build 5-shot prompts using the dev split as exemplars.
Send each test question via the OpenAI-compatible chat completions endpoint (Ollama / vLLM / etc.).
Parse the model's response for the first standalone letter A-E.
Aggregate accuracy per subtask and overall.

Why generative, not loglikelihood: The traditional lm-evaluation-harness approach uses loglikelihood scoring, which requires the inference backend to return token logprobs. Ollama's chat/completions endpoint does not return logprobs, so we use a generative letter-pick approach. Generative MMLU scores typically land within ~3pp of loglikelihood MMLU for well-instructed models.

Runner source: scripts/run-turkish-mmlu.ts and scripts/turkish_mmlu_generative.py in the public repo.

HOW TO READ THE SCORE

Score is overall accuracy percentage across all 900 questions. Random baseline is 20% (5 answer choices). Strong models score 40-60%; weak/broken setups score below 20% (suggests context overflow or chat template mismatch).

LEADERBOARD

Public reviewed runs, ranked

4 rows

#	Model	Quant	Rig	Score	Trust	Log
1	Trendyol LLM Asure 12B 11.8B · gemma	Q4_K_M	ollama-0.24.0 rtx-5080	58.9	First-party runlocalai command published	Gist →
2	Llama 3.2 3B Instruct 3B · llama	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	11.4	First-party runlocalai command published	Gist →
3	Turkish Llama 8B Instruct v0.1 8B · llama	Q4_K_M	ollama-0.24 rtx-3080	11.0	First-party runlocalai command published	Gist →
4	Turkish Llama 8B Instruct v0.1 8B · llama	Q4_K_M	ollama-0.24 rtx-3080-16gb-mobile	11.0	First-party runlocalai command published	Gist →

SUB-TASK BREAKDOWN

Per-subject accuracy

Model	Biology	History	Physics	Chemistry	Geography	Philosophy	Mathematics	Religion and Ethics	Turkish Language and Literature
Trendyol LLM Asure 12B (Q4_K_M)	56	65	37	47	70	84	35	83	53
Llama 3.2 3B Instruct (Q4_K_M)	8	16	12	6	7	15	11	6	22
Turkish Llama 8B Instruct v0.1 (Q4_K_M)	13	11	11	6	11	15	10	10	12

OPERATOR NOTES

Trendyol LLM Asure 12B · Q4_K_M · ollama-0.24.0 · rtx-5080

First-party text-only TurkishMMLU generative run on local Ollama tag alibayram/Trendyol-LLM-Asure-12B:latest. Source model card: alibayram/Trendyol-LLM-Asure-12B; local GGUF source: alibayram/Trendyol-LLM-Asure-12B-Q4_K_M-GGUF. Hardware: RTX 5080 16GB, NVIDIA driver 595.97.

Llama 3.2 3B Instruct · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

English-trained 3B baseline comparison vs Turkish-specialized 8B. Run on RTX 3080 Laptop 16GB, num_ctx=8192. Expected to score near random (20%) or below since model has no Turkish specialization.

Turkish Llama 8B Instruct v0.1 · Q4_K_M · ollama-0.24 · rtx-3080

Baseline run on Ollama 0.24 with default 2048 context window. Score is below the 20% random-guess baseline — strong indicator that 5-shot Turkish prompts (which average ~2000 tokens due to morphology) were silently truncated by Ollama. Re-run with --num-ctx 8192 expected to land 30-45%. Published as-is so the methodology improvement is measurable; this row is intentionally NOT promoted to 'verified'.

Turkish Llama 8B Instruct v0.1 · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

Re-run on RTX 3080 Laptop (16 GB) with `num_ctx=8192` to test the earlier hypothesis that the prior 11% score was caused by Ollama's default 2048-context window truncating 5-shot Turkish prompts. The re-run **landed at the same 11.00%**, ruling out the truncation hypothesis. The honest reading: Turkish-Llama-8B-Instruct-v0.1 was trained as a **Turkish conversational** model, not a multi-choice reasoning model. It speaks Turkish fluently but underperforms even the 20% random-guess baseline on TurkishMMLU's scientific/historical/literary subjects. Per-subject results: Biology 13%, Chemistry 6%, Geography 15% est., History 11%, Mathematics 12-15% est., Philosophy 15%, Physics 11%, Religious Culture & Ethics 10%, Turkish Language & Literature 12%. Use this model for chat/customer-service Turkish, not for structured Q&A. Higher-knowledge Turkish models (Trendyol Asure 12B at 58.89%) are the right anchor for general-knowledge use cases.