Transformer & LLM components

Encoder

An encoder is a neural network component that processes input data (text, images, audio) into a dense representation—a vector or sequence of vectors—that captures the input's meaning or structure. In transformer architectures, the encoder uses self-attention to build a context-aware representation of the entire input, which is then passed to a decoder or used directly for tasks like classification or retrieval. Operators encounter encoders in models like BERT (encoder-only), T5 (encoder-decoder), or vision transformers (ViT). Encoder-only models are common for embedding generation, where the output representation is used for semantic search or clustering.

Deeper dive

In the transformer architecture, the encoder consists of a stack of identical layers, each containing multi-head self-attention and feed-forward networks. Unlike decoders, encoders use bidirectional self-attention—each token can attend to all other tokens in the input sequence. This makes encoders ideal for understanding tasks (e.g., sentiment analysis, named entity recognition) rather than generation. Popular encoder-only models include BERT, RoBERTa, and DistilBERT. Operators running local AI often use encoder models to generate embeddings: the encoder outputs a fixed-size vector per token or a pooled representation for the whole input. These embeddings feed into vector databases (e.g., Chroma, FAISS) for retrieval-augmented generation (RAG) or semantic search. Encoder-decoder models like T5 and BART use the encoder's output to condition the decoder's generation. When quantizing an encoder model, the same VRAM considerations apply—a BERT-base model at FP16 uses ~440 MB, while at Q4 it drops to ~110 MB, fitting easily on most GPUs.

Practical example

An operator running semantic search on local documents might use sentence-transformers/all-MiniLM-L6-v2, an encoder-only model. With Ollama, they pull the model and generate embeddings via ollama pull all-minilm then curl http://localhost:11434/api/embeddings -d '{"model": "all-minilm", "prompt": "Your text here"}'. The returned embedding vector (384 floats) can be stored in a vector database. On a 6 GB VRAM GPU, this model at FP16 uses ~90 MB, leaving room for other tasks.

Workflow example

In a RAG pipeline, the operator first runs an encoder model to embed all documents: python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('all-MiniLM-L6-v2'); embeddings = model.encode(docs)". Then, at query time, the same encoder embeds the user's question. The runtime (e.g., ChromaDB) compares query embedding to document embeddings using cosine similarity. The operator sees latency of ~10-50 ms per embedding on a modern GPU, depending on sequence length.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work