RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Natural language processing / Machine Translation
Natural language processing

Machine Translation

Machine translation (MT) is the task of automatically translating text from one natural language to another using a neural network model. Operators encounter MT when running models like NLLB-200, M2M-100, or specialized fine-tunes of Llama or Mistral that can translate between language pairs. The model takes a source-language sentence as input and generates a target-language sentence token by token. MT models are typically encoder-decoder transformers, though some decoder-only LLMs can translate with appropriate prompting. Key operator concerns: VRAM usage (larger models like NLLB-200 3.3B require ~6 GB at Q4), latency (translation of a paragraph takes seconds on consumer GPUs), and quality trade-offs between model size and speed.

Deeper dive

Modern neural machine translation (NMT) uses transformer architectures. Encoder-decoder models like NLLB-200 and M2M-100 process the source sentence with an encoder, then an autoregressive decoder generates the translation. Decoder-only LLMs (e.g., Llama, Mistral) can also translate via prompting: "Translate from English to French: 'Hello' -> 'Bonjour'". However, they may require careful prompt engineering and are often less reliable for low-resource languages. Quantization (e.g., Q4_K_M) reduces VRAM footprint but can slightly degrade translation quality, especially for rare words. Operators running MT locally should consider: language pair coverage (NLLB-200 supports 200 languages), batch size for throughput, and whether to use CPU offload for models that exceed VRAM. Tools like Hugging Face Transformers, llama.cpp, and vLLM support MT inference; Ollama does not natively, but custom Modelfiles can wrap MT models.

Practical example

An operator wants to translate English to Swahili on an RTX 3060 12GB. They download NLLB-200-distilled-600M (Q4, ~350 MB) from Hugging Face and run it via Transformers: model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M"). The model fits entirely in VRAM and translates a 100-word paragraph in ~2 seconds. If they try the 3.3B version (Q4, ~2 GB), it still fits but latency increases to ~8 seconds. For higher quality, they might use a 7B Llama fine-tune (Q4, ~4 GB) but must prompt correctly.

Workflow example

In a local AI workflow, an operator might run machine translation via a script using Hugging Face Transformers. They load a model like facebook/nllb-200-3.3B, tokenize the source text with the appropriate language token (e.g., eng_Latn), and call model.generate(). In llama.cpp, they can run a GGUF version of NLLB with ./main -m nllb-200-3.3b-q4_K_M.gguf -p "Translate to French: Hello". For batch translation, vLLM supports encoder-decoder models with --model facebook/nllb-200-3.3B. Operators monitor VRAM usage with nvidia-smi and adjust context length or batch size to avoid OOM.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →