Capability notes
Machine translation quality is evaluated through **automated metrics** (BLEU, COMET, chrF) scored against reference translations, and **human evaluation** measuring adequacy and fluency. LLM-based translation using instruction-tuned models has closed the gap with specialized neural machine translation (NMT) on high-resource pairs (English↔French/German/Spanish/Chinese) and often exceeds NMT on low-resource pairs.
**BLEU scores** (0-100): [Llama 3.3 70B](/models/llama-3-3-70b) achieves 35-40 English→German, [Aya Expanse 32B](/models/aya-expanse-32b) 33-38, Google Translate 38-43, DeepL 40-45. The 5-point gap on European pairs is noticeable — human evaluators prefer commercial ~60-70% in A/B tests. For low-resource pairs (English→Swahili, English→Bengali), Aya Expanse often outperforms Google Translate because it was trained on 100+ languages including pairs commercial APIs neglect.
**COMET scores** (0-1, better correlation with human judgment): Aya Expanse 32B scores 0.82-0.88 on WMT test sets for high-resource pairs vs DeepL's 0.85-0.92. Gap narrows to 0.01-0.03 for mid-resource pairs (English↔Czech, Turkish).
**Model selection**: [Aya Expanse 32B](/models/aya-expanse-32b) is the best general-purpose open-weight multilingual translator — 100+ languages with instruction-following including formality control. [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) handles complex translation + localization (idioms, cultural references, marketing copy). [Llama 3.3 70B](/models/llama-3-3-70b) strong on Western European pairs, weaker on Asian/African languages. [DeepSeek V4](/models/deepseek-v4) with reasoning mode for ambiguous source text (legal, philosophical).
**Specialized NMT vs LLM**: NMT models (Argos Translate, OPUS-MT, Meta's NLLB-200 at 3.3B params) are 10-100× smaller than LLMs, run on CPU at 100-1000× faster throughput. For constrained-domain translation (technical manuals, medical reports), specialized NMT matches or exceeds LLM quality at a fraction of compute. NLLB-200 is the reference NMT baseline for 200 languages.
If you just want to try this
Lowest-friction path to a working setup.
Pull [Aya Expanse 32B](/models/aya-expanse-32b) on [Ollama](/tools/ollama) (`ollama pull aya-expanse:32b`). Best open-weight multilingual instruction model for straightforward translation — covers 100+ languages, understands translation-specific instructions ("translate to informal Japanese"), and fits consumer hardware. At Q4, ~20 GB VRAM — an [RTX 3090](/hardware/rtx-3090) or [RTX 4090](/hardware/rtx-4090) handles it. At Q2-Q3, fits 16 GB cards like [RTX 5070 Ti](/hardware/rtx-5070-ti).
Prompt format: "Translate the following [source language] text to [target language]. Preserve formatting, proper nouns, and technical terms. If a term has no direct translation, keep the original and add [explanation in brackets]."
For European pairs where quality matters most, [Llama 3.3 70B](/models/llama-3-3-70b) at Q4 on [RTX 4090](/hardware/rtx-4090) produces noticeably better translations — BLEU gap 2-5 points perceptible to native speakers. Requires ~40 GB — fits RTX 4090 with partial offload or [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max).
If you don't have >16 GB VRAM, use [llama.cpp](/tools/llama-cpp) with CPU inference. Aya Expanse 32B Q4 on a 16-core CPU with 32 GB RAM translates at 5-10 tok/s — a 5,000-word document in 5-10 minutes. For simplest possible setup: Argos Translate (`pip install argostranslate`) runs on CPU at 500-1000 words/second for supported pairs. Quality lower than LLM but 100× faster on any laptop.
For production deployment
Operator-grade recommendation.
Production translation pipelines choose between LLM-based (higher quality, higher cost) and specialized NMT (lower quality, lower cost) by language pair, domain, and accuracy requirements.
**LLM translation pipeline**: Deploy [Aya Expanse 32B](/models/aya-expanse-32b) or [Llama 3.3 70B](/models/llama-3-3-70b) behind [vLLM](/tools/vllm) as translation API. Continuous batching handles 10-50 concurrent requests on a single [RTX 4090](/hardware/rtx-4090). Chunk by paragraph (not sentence — paragraph-level context improves pronoun resolution and discourse coherence 30-50%).
**Specialized NMT pipeline**: Deploy NLLB-200 (3.3B params) via [CTranslate2](https://github.com/OpenNMT/CTranslate2). On [RTX 3060 12GB](/hardware/rtx-3060-12gb), translates 500-2000 words/second — 50-200× faster than LLM. For constrained domains, NMT quality is within 1-3 BLEU of LLM.
**Hybrid pipeline (pragmatic)**: Route by language pair and domain. High-resource European pairs → Argos Translate/OPUS-MT (fast, cheap). Low-resource or domain-complex → Aya Expanse 32B (slower, costlier). Quality-gate: run COMET quality estimation on NMT output; if score <0.75, re-route to LLM. Catches 80-90% of NMT failures.
**Cost economics**: Aya Expanse 32B on RTX 4090 (450W, $0.10-0.15/kWh) translates ~5,000-15,000 words/kWh → $0.003-0.01 per 1,000 words. Google Translate API: $20/million characters. At 10M words/month, self-hosted is $30-100 vs $80 API — modest. At 1B words/month, self-hosted is $3,000-10,000 vs $8,000 API — meaningful.
**Quality assurance**: Human evaluation of 1-5% of translations (stratified by language pair and COMET score). Track BLEU/COMET over time per language pair. Maintain terminology database for consistent translation of domain-specific terms.
What breaks
Failure modes operators see in the wild.
- **Code-switching corruption.** Source text containing small amounts of a second language causes the model to switch output language mid-sentence, producing trilingual gibberish. Common in software docs. Mitigation: detect language segments in source, translate each independently with explicit language tags.
- **Formality level mismatch.** Wrong register — informal "tu" for formal "vous," casual "du" for "Sie." Translation is word-correct but socially wrong. Mitigation: explicit formality instruction in prompt or post-translation formality classifier.
- **Cultural context loss.** Idioms and metaphors translated literally produce nonsense. "It's raining cats and dogs" → literal translation meaningless in target language. Mitigation: explicit localization instruction ("replace idioms with target-language equivalents"). Maintain glossary of idiom pairs per language pair.
- **Proper noun mistranslation.** Names, brands get translated when they shouldn't or stay untranslated when they should. Mitigation: NER pre-pass to identify named entities → preserve via placeholder → reinsert after translation.
- **Hallucinated additions in empty segments.** Incomplete sentences or placeholder text ("TODO") cause LLMs to "helpfully" generate plausible content. Dangerous in technical docs. Mitigation: mark empty segments with explicit tags; post-process to verify no content added where source was empty.
- **RTL language rendering issues.** Arabic/Hebrew/Persian text mixed with LTR text (numbers, URLs) produces incorrect visual ordering. Mitigation: wrap mixed-direction segments in Unicode bidirectional control characters. Test on actual RTL-configured systems.
- **Sentence segmentation errors.** Wrong boundary detection (abbreviations, decimal numbers) produces disconnected translations. Chinese/Japanese without explicit boundaries particularly vulnerable. Mitigation: use language-specific segmentation (spaCy, Stanza, ICU BreakIterator), not regex.
Hardware guidance
**Hobbyist ($500-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). Runs Aya Expanse 32B Q2-Q3 (~10-14 GB) or 7-8B LLMs at Q8 for translation. Quality modest at Q2 — BLEU drops 2-4 points from Q4 baseline. CPU+NMT (NLLB-200, Argos) for high-throughput adequate-quality translation. [Apple M4 Pro](/hardware/apple-m4-pro) 24GB runs Aya 32B Q3 — a solid $1,400 translation workstation with unified memory advantage.
**SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Aya 32B Q4-Q5 with 16K context — quality sweet spot. 5090 32 GB runs Llama 3.3 70B Q4 entirely in VRAM. Throughput: 50-200 paragraphs/min for 32B, 20-80 for 70B.
**Enterprise ($8,000-$25,000)**: [RTX A6000](/hardware/rtx-a6000) 48 GB or [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for sustained 24/7 serving. 2× [RTX 5090](/hardware/rtx-5090) (64 GB) for tensor-parallel 70B Q8. More VRAM → larger models at higher quantization → directly improves BLEU/COMET.
**Frontier ($50,000+)**: [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) or [H200](/hardware/nvidia-h200) for [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) at FP8 — best-in-class multilingual and rare-language performance. Worth it when translation quality directly impacts revenue.
**CPU-only viable**: 16-32 core CPUs (Ryzen 9950X, i9-14900K) with 64+ GB RAM run 32B Q4 at 8-15 tok/s via [llama.cpp](/tools/llama-cpp). For overnight batch of 50,000+ pages, CPU is more cost-effective than GPU — trading time for hardware cost.
Runtime guidance
**If you need one-off translations on your machine** → [Ollama](/tools/ollama) with [Aya Expanse 32B](/models/aya-expanse-32b). Zero setup, interactive with format preservation. European pairs: [Llama 3.3 70B](/models/llama-3-3-70b) on Ollama or [LM Studio](/tools/lm-studio). Apple Silicon: [MLX LM](/tools/mlx-lm).
**If building production translation API** → [vLLM](/tools/vllm) behind FastAPI. Continuous batching handles 10-50 concurrent requests on single [RTX 4090](/hardware/rtx-4090) with <5s latency/paragraph. Request queuing with priority (interactive before batch).
**If speed/cost dominate over quality (constrained domain)** → Specialized NMT via [CTranslate2](https://github.com/OpenNMT/CTranslate2) with OPUS-MT/NLLB-200. 2-4× speedup over raw Transformers. Deploy as primary engine with LLM as quality-escalation fallback (COMET <0.75 triggers LLM re-translation).
**If batch document translation (website localization)** → Argos Translate for high-resource pairs at 500-1,000 words/second on CPU. For quality-critical: paragraph-chunked LLM translation via vLLM batch. Pipeline: ingestion → language detection → paragraph segmentation → LLM translation → terminology validation → output assembly.
**If real-time chat translation** → Streaming via WebSocket. Aya Expanse 32B on RTX 4090: 200-500ms TTFT for short sentences. Sub-200ms: use NLLB-200 distilled to 600M params. Trade quality for speed.
**If glossary-enforced translation (domain terminology)** → vLLM with prompt injection. Maintain terminology database (JSON/SQL) of source→target mappings per language pair. Prepend glossary to prompt. vLLM supports prompt templates with variable substitution for glossary injection. For NMT: constrained decoding to force specific term translations.