RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Natural language processing / Word Embedding
Natural language processing

Word Embedding

A word embedding is a dense vector of floating-point numbers that maps a word (or token) to a point in a high-dimensional space. In practice, every token in a language model's vocabulary has a corresponding embedding vector, typically 768 to 4096 dimensions. The model learns these vectors during training so that words with similar meanings (e.g., 'king' and 'queen') have vectors that are close together in that space. Operators encounter embeddings as the first layer of a transformer model: input tokens are converted to embeddings before being processed by attention layers. The size of the embedding dimension directly affects VRAM usage and inference speed.

Deeper dive

Word embeddings are the foundation of how neural networks represent language. Unlike one-hot encoding (a sparse vector with a single 1), embeddings are learned, dense, and low-dimensional. The key property is that semantic relationships are encoded as vector arithmetic: the classic example is vec('king') - vec('man') + vec('woman') ≈ vec('queen'). In modern LLMs, embeddings are typically learned jointly with the rest of the model via backpropagation. The embedding matrix has shape [vocab_size, d_model], where d_model is the hidden dimension (e.g., 4096 for Llama 3.1 8B). This matrix alone can be large: for a 128k vocabulary and 4096 dimensions, it's 128k × 4096 × 2 bytes (if FP16) ≈ 1 GB. Operators should note that embedding lookup is a memory-bound operation, not compute-bound, so it benefits from fast VRAM bandwidth rather than high GPU clock speeds.

Practical example

When running Llama 3.1 8B (vocab size 128k, d_model=4096) on an RTX 3090 (24 GB VRAM), the embedding layer alone occupies about 1 GB in FP16. If you switch to Q4_K_M quantization, the embedding layer is typically kept in FP16 for accuracy, so it still uses ~1 GB. This means that even with quantization, the embedding layer consumes a fixed chunk of VRAM that doesn't shrink much, affecting how much context you can fit.

Workflow example

In llama.cpp, when you load a model, the embedding matrix is allocated in VRAM as part of the model weights. You can see its size in the console output: 'llama_model_load: embedding size = 4096'. In Hugging Face Transformers, the embedding layer is accessed via model.get_input_embeddings(). When fine-tuning with LoRA, the embedding layer is often frozen (not updated) because it contains general knowledge; updating it would require full fine-tuning of the entire matrix.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →