T5
T5 (Text-to-Text Transfer Transformer) is a sequence-to-sequence model from Google that converts every NLP task into a text-to-text format. Input and output are always text strings, with task prefixes like 'translate English to German: ' or 'summarize: '. Operators encounter T5 in Hugging Face Transformers for fine-tuning or inference, and in quantized forms for local deployment. Its encoder-decoder architecture uses roughly 2x the compute of a decoder-only model of similar parameter count, making it heavier on consumer hardware.
Deeper dive
T5 was introduced in 2019 by Google and comes in sizes from 60M to 11B parameters. The key innovation is the unified text-to-text framework: every task (translation, summarization, classification, Q&A) is framed as 'input text → output text'. The model uses a standard Transformer encoder-decoder with relative position biases. For operators, T5's encoder-decoder structure means it requires more VRAM and compute than equivalent decoder-only models (like GPT) because both encoder and decoder must be loaded. Quantization (e.g., 4-bit GPTQ) reduces memory but the two-pass nature still slows inference. Variants like Flan-T5 fine-tune on many tasks for better zero-shot performance. On consumer GPUs, T5-3B at 8-bit fits in ~8 GB VRAM, but inference is slower than a similar-sized Llama model.
Practical example
A 16 GB VRAM RTX 4060 can run Flan-T5-XL (3B) at 8-bit quantization (3.5 GB) with a 512-token context, achieving ~15 tokens/sec. The same card runs Llama 3.1 8B at Q4 (5 GB) at ~40 tokens/sec. The encoder-decoder overhead means T5 uses more memory and compute per token than a decoder-only model of similar parameter count.
Workflow example
In Hugging Face Transformers, loading T5 for inference: from transformers import T5ForConditionalGeneration, T5Tokenizer; model = T5ForConditionalGeneration.from_pretrained('google/flan-t5-xl', load_in_8bit=True). The tokenizer prepends the task prefix automatically. In LM Studio, T5 models appear under 'Text-to-Text' category; selecting one shows the encoder-decoder architecture in the model info panel. Quantized versions (e.g., TheBloke/Flan-T5-XL-GPTQ) are available for Ollama via custom Modelfiles.
Reviewed by Fredoline Eruo. See our editorial policy.