Transformer & LLM components

Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is a subword tokenization algorithm that splits text into a sequence of tokens by iteratively merging the most frequent adjacent byte pairs. In local AI, tokenization converts raw text into token IDs that the model processes. BPE produces a fixed-size vocabulary (e.g., 32,000 or 128,000 tokens) that balances coverage of common words with the ability to represent rare or unknown words as subword units. Operators encounter BPE when loading tokenizer files (e.g., tokenizer.json or vocab.json) that define the merge rules. The token count directly affects context window usage and inference speed: more tokens per input means slower generation.

Deeper dive

BPE was originally a data compression technique, adapted for NLP by Sennrich et al. (2016). The algorithm starts with a vocabulary of individual characters (or bytes) and counts all adjacent pairs in the training corpus. It then merges the most frequent pair into a new token, adds it to the vocabulary, and repeats until a target vocabulary size is reached. The resulting merge operations are stored as a set of rules. During tokenization, the input text is split into characters, then the merge rules are applied greedily in order of frequency. Modern LLMs like GPT-2, GPT-4, Llama, and Mistral use BPE with a byte-level variant that handles any Unicode character without a separate pre-tokenization step. The tokenizer is typically a separate file (e.g., tokenizer.json in Hugging Face format) that contains the merge rules and special tokens. Operators should know that tokenization is not reversible: different inputs can produce the same token sequence, and the tokenizer's behavior affects model performance on non-English text or code.

Practical example

When you download a model like Mistral-7B-v0.1, the tokenizer file tokenizer.json contains 32,000 BPE merges. The word 'hello' might tokenize as ['hel', 'lo'] (two tokens) while 'Hello' might be ['Hello'] (one token) because case matters. A 4K context window can hold roughly 4,000 tokens, so a 100-word paragraph (~130 tokens) leaves room for a 3,870-token response. If the tokenizer splits words into more tokens, the effective context shrinks.

Workflow example

In llama.cpp, tokenization happens automatically when you run ./main -m model.gguf -p "Hello world". The runtime loads the tokenizer from the GGUF file and converts the prompt to token IDs. You can inspect tokenization with ./tokenize -m model.gguf "Hello world" which prints the token IDs. In Ollama, the tokenizer is embedded in the model file; you don't interact with it directly. In Hugging Face Transformers, you can run tokenizer.tokenize("Hello world") to see the subword splits.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work