Tokenizer Impact on Quality — Understanding AI Models (Chapter 17)

The tokenizer is often overlooked but affects model quality significantly. Understanding tokenizer properties helps you predict model behavior on your specific text.

What tokenizers do:

A tokenizer converts text to tokens (integers the model processes) and back. Different tokenizers produce different token sequences for the same text.

Comparing tokenization:

# Example with different tokenizers
text = "Understanding AI models requires knowledge of tokenization."

# Token count varies significantly
llama_tokenizer: 9 tokens
cl100k_base (GPT-4): 11 tokens  
sentencepiece: 8 tokens

Why token count matters:

Context efficiency: More tokens = fewer tokens available for output
Generation speed: More tokens = more generation steps
Cost (for API): More tokens = higher cost

Common tokenizer types:

Tokenizer	Used by	Characteristics
Tiktoken (cl100k_base)	GPT-4, Codex	Good for code, English
SentencePiece	Llama, Mistral	Multilingual, consistent
BPE	Many models	Balance of vocabulary and rules
WordPiece	Older models	Larger vocab, simpler splits

Vocabulary size effects:

Small vocab (<30k):  Faster inference, larger tokens
Large vocab (>50k):  More efficient for diverse scripts

A tokenizer with 32k vocab uses ~2 bytes per token for vocabulary indices. At 100 tokens/second, that is 200 bytes/second overhead-negligible.

Non-English efficiency:

Tokenizers trained primarily on English are inefficient for other languages:

# Example: Chinese tokenization
english_text = "The quick brown fox"
english_tokens = tokenizer.encode(english_text)
# 5 tokens for 19 characters (3.8 chars/token)

chinese_text = "??????"
chinese_tokens = tokenizer.encode(chinese_text)
# 4 tokens for 6 characters (1.5 chars/token)

This means a model with an English-centric tokenizer needs roughly 2.5x more tokens to represent the same Chinese text, eating into context window.

Testing tokenizer efficiency:

def test_tokenizer_efficiency(tokenizer, texts_by_language):
    results = {}
    
    for language, texts in texts_by_language.items():
        total_chars = sum(len(t) for t in texts)
        total_tokens = sum(len(tokenizer.encode(t)) for t in texts)
        
        results[language] = {
            "chars_per_token": total_chars / total_tokens,
            "total_tokens": total_tokens,
            "efficiency_ratio": total_chars / total_tokens / (total_chars / total_tokens_for_english)
        }
    
    return results