Machine Translation — AI glossary

Machine translation (MT) is the task of automatically translating text from one natural language to another using a neural network model. Operators encounter MT when running models like NLLB-200, M2M-100, or specialized fine-tunes of Llama or Mistral that can translate between language pairs. The model takes a source-language sentence as input and generates a target-language sentence token by token. MT models are typically encoder-decoder transformers, though some decoder-only LLMs can translate with appropriate prompting. Key operator concerns: VRAM usage (larger models like NLLB-200 3.3B require ~6 GB at Q4), latency (translation of a paragraph takes seconds on consumer GPUs), and quality trade-offs between model size and speed.

Deeper dive

Modern neural machine translation (NMT) uses transformer architectures. Encoder-decoder models like NLLB-200 and M2M-100 process the source sentence with an encoder, then an autoregressive decoder generates the translation. Decoder-only LLMs (e.g., Llama, Mistral) can also translate via prompting: "Translate from English to French: 'Hello' -> 'Bonjour'". However, they may require careful prompt engineering and are often less reliable for low-resource languages. Quantization (e.g., Q4_K_M) reduces VRAM footprint but can slightly degrade translation quality, especially for rare words. Operators running MT locally should consider: language pair coverage (NLLB-200 supports 200 languages), batch size for throughput, and whether to use CPU offload for models that exceed VRAM. Tools like Hugging Face Transformers, llama.cpp, and vLLM support MT inference; Ollama does not natively, but custom Modelfiles can wrap MT models.

Practical example

An operator wants to translate English to Swahili on an RTX 3060 12GB. They download NLLB-200-distilled-600M (Q4, ~350 MB) from Hugging Face and run it via Transformers: model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M"). The model fits entirely in VRAM and translates a 100-word paragraph in ~2 seconds. If they try the 3.3B version (Q4, ~2 GB), it still fits but latency increases to ~8 seconds. For higher quality, they might use a 7B Llama fine-tune (Q4, ~4 GB) but must prompt correctly.

Workflow example

In a local AI workflow, an operator might run machine translation via a script using Hugging Face Transformers. They load a model like facebook/nllb-200-3.3B, tokenize the source text with the appropriate language token (e.g., eng_Latn), and call model.generate(). In llama.cpp, they can run a GGUF version of NLLB with ./main -m nllb-200-3.3b-q4_K_M.gguf -p "Translate to French: Hello". For batch translation, vLLM supports encoder-decoder models with --model facebook/nllb-200-3.3B. Operators monitor VRAM usage with nvidia-smi and adjust context length or batch size to avoid OOM.