17. Tokenizer Impact on Quality
The tokenizer is often overlooked but affects model quality significantly. Understanding tokenizer properties helps you predict model behavior on your specific text.
What tokenizers do:
A tokenizer converts text to tokens (integers the model processes) and back. Different tokenizers produce different token sequences for the same text.
Comparing tokenization:
# Example with different tokenizers
text = "Understanding AI models requires knowledge of tokenization."
# Token count varies significantly
llama_tokenizer: 9 tokens
cl100k_base (GPT-4): 11 tokens
sentencepiece: 8 tokens
Why token count matters:
- Context efficiency: More tokens = fewer tokens available for output
- Generation speed: More tokens = more generation steps
- Cost (for API): More tokens = higher cost
Common tokenizer types:
| Tokenizer | Used by | Characteristics |
|---|---|---|
| Tiktoken (cl100k_base) | GPT-4, Codex | Good for code, English |
| SentencePiece | Llama, Mistral | Multilingual, consistent |
| BPE | Many models | Balance of vocabulary and rules |
| WordPiece | Older models | Larger vocab, simpler splits |
Vocabulary size effects:
Small vocab (<30k): Faster inference, larger tokens
Large vocab (>50k): More efficient for diverse scripts
A tokenizer with 32k vocab uses ~2 bytes per token for vocabulary indices. At 100 tokens/second, that is 200 bytes/second overhead-negligible.
Non-English efficiency:
Tokenizers trained primarily on English are inefficient for other languages:
# Example: Chinese tokenization
english_text = "The quick brown fox"
english_tokens = tokenizer.encode(english_text)
# 5 tokens for 19 characters (3.8 chars/token)
chinese_text = "??????"
chinese_tokens = tokenizer.encode(chinese_text)
# 4 tokens for 6 characters (1.5 chars/token)
This means a model with an English-centric tokenizer needs roughly 2.5x more tokens to represent the same Chinese text, eating into context window.
Testing tokenizer efficiency:
def test_tokenizer_efficiency(tokenizer, texts_by_language):
results = {}
for language, texts in texts_by_language.items():
total_chars = sum(len(t) for t in texts)
total_tokens = sum(len(tokenizer.encode(t)) for t in texts)
results[language] = {
"chars_per_token": total_chars / total_tokens,
"total_tokens": total_tokens,
"efficiency_ratio": total_chars / total_tokens / (total_chars / total_tokens_for_english)
}
return results
Compare token counts for English, code, and a non-English language across 3 models. Note how tokenizer efficiency varies with content type.