Natural language processing

Word Embedding

A word embedding is a dense vector of floating-point numbers that maps a word (or token) to a point in a high-dimensional space. In practice, every token in a language model's vocabulary has a corresponding embedding vector, typically 768 to 4096 dimensions. The model learns these vectors during training so that words with similar meanings (e.g., 'king' and 'queen') have vectors that are close together in that space. Operators encounter embeddings as the first layer of a transformer model: input tokens are converted to embeddings before being processed by attention layers. The size of the embedding dimension directly affects VRAM usage and inference speed.

Deeper dive

Word embeddings are the foundation of how neural networks represent language. Unlike one-hot encoding (a sparse vector with a single 1), embeddings are learned, dense, and low-dimensional. The key property is that semantic relationships are encoded as vector arithmetic: the classic example is vec('king') - vec('man') + vec('woman') ≈ vec('queen'). In modern LLMs, embeddings are typically learned jointly with the rest of the model via backpropagation. The embedding matrix has shape [vocab_size, d_model], where d_model is the hidden dimension (e.g., 4096 for Llama 3.1 8B). This matrix alone can be large: for a 128k vocabulary and 4096 dimensions, it's 128k × 4096 × 2 bytes (if FP16) ≈ 1 GB. Operators should note that embedding lookup is a memory-bound operation, not compute-bound, so it benefits from fast VRAM bandwidth rather than high GPU clock speeds.

Practical example

When running Llama 3.1 8B (vocab size 128k, d_model=4096) on an RTX 3090 (24 GB VRAM), the embedding layer alone occupies about 1 GB in FP16. If you switch to Q4_K_M quantization, the embedding layer is typically kept in FP16 for accuracy, so it still uses ~1 GB. This means that even with quantization, the embedding layer consumes a fixed chunk of VRAM that doesn't shrink much, affecting how much context you can fit.

Workflow example

In llama.cpp, when you load a model, the embedding matrix is allocated in VRAM as part of the model weights. You can see its size in the console output: 'llama_model_load: embedding size = 4096'. In Hugging Face Transformers, the embedding layer is accessed via model.get_input_embeddings(). When fine-tuning with LoRA, the embedding layer is often frozen (not updated) because it contains general knowledge; updating it would require full fine-tuning of the entire matrix.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work