Positional Encoding
Positional encoding is a technique used in transformer models to inject information about the position of tokens in a sequence. Unlike recurrent neural networks, transformers process all tokens in parallel and have no built-in notion of order. Positional encodings are added to the token embeddings before they enter the model's attention layers, allowing the model to distinguish between "the cat sat on the mat" and "the mat sat on the cat." Common implementations use sinusoidal functions (as in the original Transformer paper) or learned embeddings. In local AI, positional encoding affects how well a model handles long contexts; some newer models use Rotary Position Embedding (RoPE), which is more efficient and allows better extrapolation to longer sequences.
Deeper dive
The original Transformer paper (Vaswani et al., 2017) used fixed sinusoidal positional encodings: for each position and each dimension, a sine or cosine wave of varying frequency is computed. This allows the model to learn relative positions because the encoding at position p+k can be represented as a linear function of the encoding at position p. Later works introduced learned positional embeddings (e.g., BERT) where each position has a trainable vector. However, these are limited to the maximum sequence length seen during training. Rotary Position Embedding (RoPE), used in Llama, Mistral, and many modern models, applies a rotation to the query and key vectors based on position, enabling better extrapolation to longer contexts and improved performance. In local AI, the choice of positional encoding influences how well a model handles context windows beyond its training length—RoPE-based models can often be used with extended context via techniques like NTK-aware scaling or YaRN.
Practical example
When running Llama 3.1 8B with a 128K context window, the model uses RoPE. If you try to load a model with learned absolute positional embeddings (like older BERT) into a local runtime and pass a sequence longer than its max training length, the runtime will either truncate or error. With RoPE, you can often increase the context window by adjusting the RoPE scaling factor in llama.cpp (e.g., --rope-scale 2.0 for 2x context).
Workflow example
In llama.cpp, when loading a model that uses RoPE, you can set --rope-scale and --rope-freq-base to extend the context window. For example, to run a 7B model with 32K context instead of the default 8K, you might use --rope-scale 4.0. In Hugging Face Transformers, the position_ids tensor is passed to the model, and RoPE is applied internally. In LM Studio, the context length slider adjusts the maximum sequence length, and the software handles RoPE scaling automatically for supported models.
Reviewed by Fredoline Eruo. See our editorial policy.