RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Encoder
Transformer & LLM components

Encoder

An encoder is a neural network component that processes input data (text, images, audio) into a dense representation—a vector or sequence of vectors—that captures the input's meaning or structure. In transformer architectures, the encoder uses self-attention to build a context-aware representation of the entire input, which is then passed to a decoder or used directly for tasks like classification or retrieval. Operators encounter encoders in models like BERT (encoder-only), T5 (encoder-decoder), or vision transformers (ViT). Encoder-only models are common for embedding generation, where the output representation is used for semantic search or clustering.

Deeper dive

In the transformer architecture, the encoder consists of a stack of identical layers, each containing multi-head self-attention and feed-forward networks. Unlike decoders, encoders use bidirectional self-attention—each token can attend to all other tokens in the input sequence. This makes encoders ideal for understanding tasks (e.g., sentiment analysis, named entity recognition) rather than generation. Popular encoder-only models include BERT, RoBERTa, and DistilBERT. Operators running local AI often use encoder models to generate embeddings: the encoder outputs a fixed-size vector per token or a pooled representation for the whole input. These embeddings feed into vector databases (e.g., Chroma, FAISS) for retrieval-augmented generation (RAG) or semantic search. Encoder-decoder models like T5 and BART use the encoder's output to condition the decoder's generation. When quantizing an encoder model, the same VRAM considerations apply—a BERT-base model at FP16 uses ~440 MB, while at Q4 it drops to ~110 MB, fitting easily on most GPUs.

Practical example

An operator running semantic search on local documents might use sentence-transformers/all-MiniLM-L6-v2, an encoder-only model. With Ollama, they pull the model and generate embeddings via ollama pull all-minilm then curl http://localhost:11434/api/embeddings -d '{"model": "all-minilm", "prompt": "Your text here"}'. The returned embedding vector (384 floats) can be stored in a vector database. On a 6 GB VRAM GPU, this model at FP16 uses ~90 MB, leaving room for other tasks.

Workflow example

In a RAG pipeline, the operator first runs an encoder model to embed all documents: python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('all-MiniLM-L6-v2'); embeddings = model.encode(docs)". Then, at query time, the same encoder embeds the user's question. The runtime (e.g., ChromaDB) compares query embedding to document embeddings using cosine similarity. The operator sees latency of ~10-50 ms per embedding on a modern GPU, depending on sequence length.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →