Transformer & LLM components

Tokenization

Tokenization is the process of converting text into the numeric tokens a model can process. Modern systems use subword tokenization (BPE, WordPiece, or SentencePiece) which strikes a balance between character-level (long sequences, generic) and word-level (short sequences, hits unknown words).

The tokenizer is part of the model — different models tokenize the same string differently. "OpenAI's GPT" might be 3 tokens for GPT-4 and 4 tokens for Llama. This affects both context budget and output speed: a more efficient tokenizer fits more content per token.

Special tokens (<bos>, <eos>, <|im_start|>, etc.) signal structure to the model — chat formatting, instruction boundaries, end-of-sequence. Misformatted special tokens are a common cause of garbled output in DIY inference setups.

Related terms

Reviewed by Fredoline Eruo. See our editorial policy.