RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Tasks/Text/Translation
Text
machine translation
language translation
multilingual

Translation

Between-language text translation. Multilingual instruction-tuned models handle this competently; specialized translation models exist for very-low-resource languages.

Capability notes

Machine translation quality is evaluated through **automated metrics** (BLEU, COMET, chrF) scored against reference translations, and **human evaluation** measuring adequacy and fluency. LLM-based translation using instruction-tuned models has closed the gap with specialized neural machine translation (NMT) on high-resource pairs (English↔French/German/Spanish/Chinese) and often exceeds NMT on low-resource pairs. **BLEU scores** (0-100): [Llama 3.3 70B](/models/llama-3-3-70b) achieves 35-40 English→German, [Aya Expanse 32B](/models/aya-expanse-32b) 33-38, Google Translate 38-43, DeepL 40-45. The 5-point gap on European pairs is noticeable — human evaluators prefer commercial ~60-70% in A/B tests. For low-resource pairs (English→Swahili, English→Bengali), Aya Expanse often outperforms Google Translate because it was trained on 100+ languages including pairs commercial APIs neglect. **COMET scores** (0-1, better correlation with human judgment): Aya Expanse 32B scores 0.82-0.88 on WMT test sets for high-resource pairs vs DeepL's 0.85-0.92. Gap narrows to 0.01-0.03 for mid-resource pairs (English↔Czech, Turkish). **Model selection**: [Aya Expanse 32B](/models/aya-expanse-32b) is the best general-purpose open-weight multilingual translator — 100+ languages with instruction-following including formality control. [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) handles complex translation + localization (idioms, cultural references, marketing copy). [Llama 3.3 70B](/models/llama-3-3-70b) strong on Western European pairs, weaker on Asian/African languages. [DeepSeek V4](/models/deepseek-v4) with reasoning mode for ambiguous source text (legal, philosophical). **Specialized NMT vs LLM**: NMT models (Argos Translate, OPUS-MT, Meta's NLLB-200 at 3.3B params) are 10-100× smaller than LLMs, run on CPU at 100-1000× faster throughput. For constrained-domain translation (technical manuals, medical reports), specialized NMT matches or exceeds LLM quality at a fraction of compute. NLLB-200 is the reference NMT baseline for 200 languages.

If you just want to try this

Lowest-friction path to a working setup.

Pull [Aya Expanse 32B](/models/aya-expanse-32b) on [Ollama](/tools/ollama) (`ollama pull aya-expanse:32b`). Best open-weight multilingual instruction model for straightforward translation — covers 100+ languages, understands translation-specific instructions ("translate to informal Japanese"), and fits consumer hardware. At Q4, ~20 GB VRAM — an [RTX 3090](/hardware/rtx-3090) or [RTX 4090](/hardware/rtx-4090) handles it. At Q2-Q3, fits 16 GB cards like [RTX 5070 Ti](/hardware/rtx-5070-ti). Prompt format: "Translate the following [source language] text to [target language]. Preserve formatting, proper nouns, and technical terms. If a term has no direct translation, keep the original and add [explanation in brackets]." For European pairs where quality matters most, [Llama 3.3 70B](/models/llama-3-3-70b) at Q4 on [RTX 4090](/hardware/rtx-4090) produces noticeably better translations — BLEU gap 2-5 points perceptible to native speakers. Requires ~40 GB — fits RTX 4090 with partial offload or [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max). If you don't have >16 GB VRAM, use [llama.cpp](/tools/llama-cpp) with CPU inference. Aya Expanse 32B Q4 on a 16-core CPU with 32 GB RAM translates at 5-10 tok/s — a 5,000-word document in 5-10 minutes. For simplest possible setup: Argos Translate (`pip install argostranslate`) runs on CPU at 500-1000 words/second for supported pairs. Quality lower than LLM but 100× faster on any laptop.

For production deployment

Operator-grade recommendation.

Production translation pipelines choose between LLM-based (higher quality, higher cost) and specialized NMT (lower quality, lower cost) by language pair, domain, and accuracy requirements. **LLM translation pipeline**: Deploy [Aya Expanse 32B](/models/aya-expanse-32b) or [Llama 3.3 70B](/models/llama-3-3-70b) behind [vLLM](/tools/vllm) as translation API. Continuous batching handles 10-50 concurrent requests on a single [RTX 4090](/hardware/rtx-4090). Chunk by paragraph (not sentence — paragraph-level context improves pronoun resolution and discourse coherence 30-50%). **Specialized NMT pipeline**: Deploy NLLB-200 (3.3B params) via [CTranslate2](https://github.com/OpenNMT/CTranslate2). On [RTX 3060 12GB](/hardware/rtx-3060-12gb), translates 500-2000 words/second — 50-200× faster than LLM. For constrained domains, NMT quality is within 1-3 BLEU of LLM. **Hybrid pipeline (pragmatic)**: Route by language pair and domain. High-resource European pairs → Argos Translate/OPUS-MT (fast, cheap). Low-resource or domain-complex → Aya Expanse 32B (slower, costlier). Quality-gate: run COMET quality estimation on NMT output; if score <0.75, re-route to LLM. Catches 80-90% of NMT failures. **Cost economics**: Aya Expanse 32B on RTX 4090 (450W, $0.10-0.15/kWh) translates ~5,000-15,000 words/kWh → $0.003-0.01 per 1,000 words. Google Translate API: $20/million characters. At 10M words/month, self-hosted is $30-100 vs $80 API — modest. At 1B words/month, self-hosted is $3,000-10,000 vs $8,000 API — meaningful. **Quality assurance**: Human evaluation of 1-5% of translations (stratified by language pair and COMET score). Track BLEU/COMET over time per language pair. Maintain terminology database for consistent translation of domain-specific terms.

What breaks

Failure modes operators see in the wild.

- **Code-switching corruption.** Source text containing small amounts of a second language causes the model to switch output language mid-sentence, producing trilingual gibberish. Common in software docs. Mitigation: detect language segments in source, translate each independently with explicit language tags. - **Formality level mismatch.** Wrong register — informal "tu" for formal "vous," casual "du" for "Sie." Translation is word-correct but socially wrong. Mitigation: explicit formality instruction in prompt or post-translation formality classifier. - **Cultural context loss.** Idioms and metaphors translated literally produce nonsense. "It's raining cats and dogs" → literal translation meaningless in target language. Mitigation: explicit localization instruction ("replace idioms with target-language equivalents"). Maintain glossary of idiom pairs per language pair. - **Proper noun mistranslation.** Names, brands get translated when they shouldn't or stay untranslated when they should. Mitigation: NER pre-pass to identify named entities → preserve via placeholder → reinsert after translation. - **Hallucinated additions in empty segments.** Incomplete sentences or placeholder text ("TODO") cause LLMs to "helpfully" generate plausible content. Dangerous in technical docs. Mitigation: mark empty segments with explicit tags; post-process to verify no content added where source was empty. - **RTL language rendering issues.** Arabic/Hebrew/Persian text mixed with LTR text (numbers, URLs) produces incorrect visual ordering. Mitigation: wrap mixed-direction segments in Unicode bidirectional control characters. Test on actual RTL-configured systems. - **Sentence segmentation errors.** Wrong boundary detection (abbreviations, decimal numbers) produces disconnected translations. Chinese/Japanese without explicit boundaries particularly vulnerable. Mitigation: use language-specific segmentation (spaCy, Stanza, ICU BreakIterator), not regex.

Hardware guidance

**Hobbyist ($500-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). Runs Aya Expanse 32B Q2-Q3 (~10-14 GB) or 7-8B LLMs at Q8 for translation. Quality modest at Q2 — BLEU drops 2-4 points from Q4 baseline. CPU+NMT (NLLB-200, Argos) for high-throughput adequate-quality translation. [Apple M4 Pro](/hardware/apple-m4-pro) 24GB runs Aya 32B Q3 — a solid $1,400 translation workstation with unified memory advantage. **SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). Aya 32B Q4-Q5 with 16K context — quality sweet spot. 5090 32 GB runs Llama 3.3 70B Q4 entirely in VRAM. Throughput: 50-200 paragraphs/min for 32B, 20-80 for 70B. **Enterprise ($8,000-$25,000)**: [RTX A6000](/hardware/rtx-a6000) 48 GB or [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for sustained 24/7 serving. 2× [RTX 5090](/hardware/rtx-5090) (64 GB) for tensor-parallel 70B Q8. More VRAM → larger models at higher quantization → directly improves BLEU/COMET. **Frontier ($50,000+)**: [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) or [H200](/hardware/nvidia-h200) for [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) at FP8 — best-in-class multilingual and rare-language performance. Worth it when translation quality directly impacts revenue. **CPU-only viable**: 16-32 core CPUs (Ryzen 9950X, i9-14900K) with 64+ GB RAM run 32B Q4 at 8-15 tok/s via [llama.cpp](/tools/llama-cpp). For overnight batch of 50,000+ pages, CPU is more cost-effective than GPU — trading time for hardware cost.

Runtime guidance

**If you need one-off translations on your machine** → [Ollama](/tools/ollama) with [Aya Expanse 32B](/models/aya-expanse-32b). Zero setup, interactive with format preservation. European pairs: [Llama 3.3 70B](/models/llama-3-3-70b) on Ollama or [LM Studio](/tools/lm-studio). Apple Silicon: [MLX LM](/tools/mlx-lm). **If building production translation API** → [vLLM](/tools/vllm) behind FastAPI. Continuous batching handles 10-50 concurrent requests on single [RTX 4090](/hardware/rtx-4090) with <5s latency/paragraph. Request queuing with priority (interactive before batch). **If speed/cost dominate over quality (constrained domain)** → Specialized NMT via [CTranslate2](https://github.com/OpenNMT/CTranslate2) with OPUS-MT/NLLB-200. 2-4× speedup over raw Transformers. Deploy as primary engine with LLM as quality-escalation fallback (COMET <0.75 triggers LLM re-translation). **If batch document translation (website localization)** → Argos Translate for high-resource pairs at 500-1,000 words/second on CPU. For quality-critical: paragraph-chunked LLM translation via vLLM batch. Pipeline: ingestion → language detection → paragraph segmentation → LLM translation → terminology validation → output assembly. **If real-time chat translation** → Streaming via WebSocket. Aya Expanse 32B on RTX 4090: 200-500ms TTFT for short sentences. Sub-200ms: use NLLB-200 distilled to 600M params. Trade quality for speed. **If glossary-enforced translation (domain terminology)** → vLLM with prompt injection. Maintain terminology database (JSON/SQL) of source→target mappings per language pair. Prepend glossary to prompt. vLLM supports prompt templates with variable substitution for glossary injection. For NMT: constrained decoding to force specific term translations.

Setup walkthrough

  1. Install Ollama → ollama pull aya-expanse:8b (~5 GB — Cohere's multilingual model, 23 languages).
  2. ollama run aya-expanse:8b → prompt: "Translate the following English to Japanese: 'The cherry blossoms bloom in early April.'"
  3. First translation in 2-5 seconds. Quality surpasses Google Translate for most language pairs.
  4. For European languages: ollama pull llama3.2:3b (~2 GB, lighter) — handles EN↔DE/FR/ES/IT competently.
  5. For low-resource languages (Swahili, Urdu, Bengali): ollama pull aya-expanse:32b (~20 GB) — the 32B variant has dramatically better low-resource coverage.
  6. Batch: cat phrases.txt | while read line; do ollama run aya-expanse:8b "Translate to French: $line"; done.

The cheap setup

Aya Expanse 8B runs at 40-60 tok/s on a used GTX 1060 6 GB ($60) — translates a paragraph in 2-5 seconds. Llama 3.2 3B runs on any $300 laptop CPU at 20-40 tok/s for major European languages. Translation is VRAM-light for the 3B-8B range. Build: used Dell Optiplex ($150) + GTX 1060 6 GB ($60) + 16 GB RAM ($30). Total: ~$240. For 50+ language pairs with low-resource coverage, the 32B Aya Expanse needs 24 GB — out of $300 range.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Aya Expanse 32B at 25-40 tok/s — near-Google-Translate quality across 23 languages including low-resource pairs. Qwen 2.5 32B (multilingual instruction-tuned) at 40-60 tok/s for Asian language pairs. Pair with Ryzen 7 7700X + 32 GB DDR5 + 1TB NVMe. Total: ~$1,500-1,800. For enterprise translation (100K+ words/day), batch with vLLM — 3-5× throughput improvement. Translation is not VRAM-intensive below 32B.

Common beginner mistake

The mistake: Using an English-centric chat model (Llama 3.1 8B, Mistral 7B) for translation and getting awkward, literal output. Why it fails: English-centric models are trained predominantly on English data — they understand other languages as "vocabulary learned from English explanations" rather than native fluency. Translations come out grammatically correct but stylistically unnatural — like a textbook, not a native speaker. The fix: Use a true multilingual model: Aya Expanse (23 languages, trained on multilingual corpora), Qwen 2.5 (strong Asian language support), or Command R+ (enterprise-grade multilingual). These models produce natural-sounding translations because they learned languages natively, not as a translation task.

Recommended setup for translation

Recommended hardware
Best GPU for local AI →
All workloads ranked across VRAM tiers.
Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running translation locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • llama.cpp too slow →

Before you buy

Verify your specific hardware can handle translation before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →

Featured models

Command R+ (Aug 2024)Aya Expanse 32BQwen 3 32B
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →