Tokenizer mismatch — when input encoding doesn't match the model
Tokenizer errors usually mean the loaded tokenizer doesn't match the model weights, the chat template is wrong, or special tokens (BOS/EOS) weren't preserved through quantization. Verify tokenizer config first.
Diagnostic order — most likely first
Tokenizer files don't match the model checkpoint
Loaded model from one repo, tokenizer from another. Token IDs go out of bounds during inference. `tokenizer.vocab_size != model.config.vocab_size`.
Always load both from the same repo: `AutoTokenizer.from_pretrained(repo)` and `AutoModelForCausalLM.from_pretrained(repo)` with the same `repo`. Don't mix-and-match versions.
Wrong chat template applied at inference
Output is coherent but the model never stops, or it answers as if you're prompting completion mode instead of chat mode.
In Transformers: `tokenizer.apply_chat_template(messages, ...)`. In llama.cpp: `--chat-template <name>` (e.g., 'llama3', 'chatml', 'mistral'). In Ollama: ensure the Modelfile has the right TEMPLATE block.
Special tokens stripped during quantization
BOS / EOS / pad / system tokens converted to plain text instead of being recognized. Model never stops generating, or starts mid-sentence.
Use a quant where special tokens are preserved. In llama.cpp: `--special` flag forces special-token handling. Otherwise re-quantize from safetensors source with `convert-hf-to-gguf.py --vocab-only` first to verify.
Custom tokenizer schema not yet supported by runtime
New model architectures (Qwen 3, DeepSeek V3) sometimes ship custom tokenizer extensions that older runtimes don't handle. Errors mention 'unknown special token' or schema mismatch.
Update runtime to HEAD: `pip install --upgrade transformers`, build llama.cpp from latest commit, etc. Custom tokenizer support lags model release by a few weeks.
Wrong vocabulary used for fine-tune
Fine-tune was trained with extended vocabulary (added tokens) but the runtime sees the base vocabulary. Token IDs above base vocab size error.
Use the fine-tune's tokenizer, not the base model's. Check `tokenizer_config.json` for `added_tokens_decoder` — fine-tunes often add tokens that the base tokenizer doesn't have.
Frequently asked questions
How do I check if my tokenizer matches my model?
`print(tokenizer.vocab_size, model.config.vocab_size)` — they should match. Also verify chat template: `tokenizer.chat_template` should be non-None for chat models. If either fails, you have a mismatch.
Can I use a different tokenizer with the same model?
Generally no — token IDs are model-specific. Same-family models (Llama 3.0 vs 3.1) often have compatible tokenizers but verify before assuming. Different families (Llama vs Mistral) never share tokenizers.
Why are special tokens so often broken in quants?
Quantization scripts sometimes strip or reorder special tokens during conversion. Always download from reputable converters (bartowski, lmstudio-community, mradermacher on HuggingFace). Random uploaders ship broken tokenizers more often than not.
Related troubleshooting
When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
Ollama 'model not found' errors trace to typos in the model name, pulling a model that doesn't exist in the official registry, network blocks on the registry, or pulling from a custom registry without auth.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: