Pollux Judge 32B
A 32B judge model built to score other LLMs' Russian-language outputs. Give it an instruction, a response, and a rubric — it returns a numeric score plus a written rationale. Built on T-pro-it-1.0 and trained entirely on synthetic POLLUX dataset data.
If you're running Russian-language model evals and need a local, auditable judge, this is a credible option — MIT license, structured output, no API dependency. The single-criterion-per-run constraint is a real workflow cost though: evaluating on multiple axes means multiple forward passes. At 32B it's not cheap to run, and 1010 downloads with 5 likes suggests limited community validation so far. Hedge: worth testing if you have a clear rubric workflow; skip if you need flexible or multi-dimensional scoring out of the box.
›Why this rating
Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.25/10. License is explicit MIT on the HF card and commercial use is correctly flagged. Metadata aligns with the card: 32B params, Russian, finetuned from T-pro-it-1.0 (Qwen2 family — 'other' is acceptable since it's a judge derivative). Description and verdict are honest, operator-voiced, and call out real constraints (single-criterion-per-run, synthetic training data, 4096 ctx tightness, weak community signal). Use case is sharply scoped to Russian LLM eval pipelines, which is a narrow but legitimate niche for local-first ops teams. Minor concern: context length of 4096 isn't directly verified in the excerpt shown but is plausible for a T-pro-it-1.0 derivative — worth a second check.
Flags: - contextLength 4096 not explicitly confirmed in README excerpt — verify against base model T-pro-it-1.0 config - Niche brand fit: Russian-only judge has limited audience for runlocalai's primarily English-speaking operator base
Overview
A 32B judge model built to score other LLMs' Russian-language outputs. Give it an instruction, a response, and a rubric — it returns a numeric score plus a written rationale. Built on T-pro-it-1.0 and trained entirely on synthetic POLLUX dataset data.
Strengths
- MIT license, commercial use permitted
- Returns both a numeric score and a text rationale in one pass
- Structured rubric input keeps scoring consistent across runs
- 32B scale gives it headroom for nuanced Russian-language judgment
Weaknesses
- One criterion per run only — multi-criteria evaluation is unsupported and results are unpredictable
- You must supply explicit criteria; the model will not choose its own
- Trained entirely on synthetic data, which may not reflect messy real-world responses
- 4096-token context is tight for long response evaluation tasks
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 17.6 GB | 23 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Pollux Judge 32B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Pollux Judge 32B?
Can I use Pollux Judge 32B commercially?
What's the context length of Pollux Judge 32B?
Source: huggingface.co/ai-forever/pollux-judge-32b
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Pollux Judge 32B runs on your specific hardware before committing money.