RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
/benchmarks/quality/turkish-mmlu-generative
BENCHMARK · TURKISH-MMLU-GENERATIVE
Turkish
General knowledge
Multilingual

TurkishMMLU (Generative)

Turkish-translated Massive Multitask Language Understanding benchmark: 900 questions across 9 subjects (Biology, Chemistry, Geography, History, Mathematics, Philosophy, Physics, Religion & Ethics, Turkish Language & Literature). We use the generative variant: the model emits a letter (A-E) as text and we parse it. Comparable to other generative MMLU runs.

Test runs
4
Metric
accuracy
Range
/100
Best score
58.9
+Submit a scoreRunner script →Original source →Vendor leaderboard →JSON →
METHODOLOGY

How we run TurkishMMLU at runlocalai

  1. Load the AYueksel/TurkishMMLU dataset from HuggingFace.
  2. For each of 9 subtasks, build 5-shot prompts using the dev split as exemplars.
  3. Send each test question via the OpenAI-compatible chat completions endpoint (Ollama / vLLM / etc.).
  4. Parse the model's response for the first standalone letter A-E.
  5. Aggregate accuracy per subtask and overall.

Why generative, not loglikelihood: The traditional lm-evaluation-harness approach uses loglikelihood scoring, which requires the inference backend to return token logprobs. Ollama's chat/completions endpoint does not return logprobs, so we use a generative letter-pick approach. Generative MMLU scores typically land within ~3pp of loglikelihood MMLU for well-instructed models.

Runner source: scripts/run-turkish-mmlu.ts and scripts/turkish_mmlu_generative.py in the public repo.

HOW TO READ THE SCORE

Score is overall accuracy percentage across all 900 questions. Random baseline is 20% (5 answer choices). Strong models score 40-60%; weak/broken setups score below 20% (suggests context overflow or chat template mismatch).

LEADERBOARD

Public reviewed runs, ranked

4 rows
#ModelQuantRigScoreTrustLog
1Trendyol LLM Asure 12B
11.8B · gemma
Q4_K_M
ollama-0.24.0
rtx-5080
58.9
First-party
runlocalai
command published
Gist →
2Llama 3.2 3B Instruct
3B · llama
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
11.4
First-party
runlocalai
command published
Gist →
3Turkish Llama 8B Instruct v0.1
8B · llama
Q4_K_M
ollama-0.24
rtx-3080
11.0
First-party
runlocalai
command published
Gist →
4Turkish Llama 8B Instruct v0.1
8B · llama
Q4_K_M
ollama-0.24
rtx-3080-16gb-mobile
11.0
First-party
runlocalai
command published
Gist →
SUB-TASK BREAKDOWN

Per-subject accuracy

ModelBiologyHistoryPhysicsChemistryGeographyPhilosophyMathematicsReligion and EthicsTurkish Language and Literature
Trendyol LLM Asure 12B (Q4_K_M)566537477084358353
Llama 3.2 3B Instruct (Q4_K_M)81612671511622
Turkish Llama 8B Instruct v0.1 (Q4_K_M)13111161115101012
OPERATOR NOTES
Trendyol LLM Asure 12B · Q4_K_M · ollama-0.24.0 · rtx-5080

First-party text-only TurkishMMLU generative run on local Ollama tag alibayram/Trendyol-LLM-Asure-12B:latest. Source model card: alibayram/Trendyol-LLM-Asure-12B; local GGUF source: alibayram/Trendyol-LLM-Asure-12B-Q4_K_M-GGUF. Hardware: RTX 5080 16GB, NVIDIA driver 595.97.

Llama 3.2 3B Instruct · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

English-trained 3B baseline comparison vs Turkish-specialized 8B. Run on RTX 3080 Laptop 16GB, num_ctx=8192. Expected to score near random (20%) or below since model has no Turkish specialization.

Turkish Llama 8B Instruct v0.1 · Q4_K_M · ollama-0.24 · rtx-3080

Baseline run on Ollama 0.24 with default 2048 context window. Score is below the 20% random-guess baseline — strong indicator that 5-shot Turkish prompts (which average ~2000 tokens due to morphology) were silently truncated by Ollama. Re-run with --num-ctx 8192 expected to land 30-45%. Published as-is so the methodology improvement is measurable; this row is intentionally NOT promoted to 'verified'.

Turkish Llama 8B Instruct v0.1 · Q4_K_M · ollama-0.24 · rtx-3080-16gb-mobile

Re-run on RTX 3080 Laptop (16 GB) with `num_ctx=8192` to test the earlier hypothesis that the prior 11% score was caused by Ollama's default 2048-context window truncating 5-shot Turkish prompts. The re-run **landed at the same 11.00%**, ruling out the truncation hypothesis. The honest reading: Turkish-Llama-8B-Instruct-v0.1 was trained as a **Turkish conversational** model, not a multi-choice reasoning model. It speaks Turkish fluently but underperforms even the 20% random-guess baseline on TurkishMMLU's scientific/historical/literary subjects. Per-subject results: Biology 13%, Chemistry 6%, Geography 15% est., History 11%, Mathematics 12-15% est., Philosophy 15%, Physics 11%, Religious Culture & Ethics 10%, Turkish Language & Literature 12%. Use this model for chat/customer-service Turkish, not for structured Q&A. Higher-knowledge Turkish models (Trendyol Asure 12B at 58.89%) are the right anchor for general-knowledge use cases.