RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Troubleshooting
  4. /Tokenizer mismatch / 'Unknown token' / 'Token ID out of range'
fatal✓Editorial·Reviewed May 2026

Tokenizer mismatch — when input encoding doesn't match the model

Tokenizer errors usually mean the loaded tokenizer doesn't match the model weights, the chat template is wrong, or special tokens (BOS/EOS) weren't preserved through quantization. Verify tokenizer config first.

Hugging Face TransformersvLLMllama.cppOllamaany tokenizer-using lib
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Tokenizer files don't match the model checkpoint

Diagnose

Loaded model from one repo, tokenizer from another. Token IDs go out of bounds during inference. `tokenizer.vocab_size != model.config.vocab_size`.

Fix

Always load both from the same repo: `AutoTokenizer.from_pretrained(repo)` and `AutoModelForCausalLM.from_pretrained(repo)` with the same `repo`. Don't mix-and-match versions.

#2

Wrong chat template applied at inference

Diagnose

Output is coherent but the model never stops, or it answers as if you're prompting completion mode instead of chat mode.

Fix

In Transformers: `tokenizer.apply_chat_template(messages, ...)`. In llama.cpp: `--chat-template <name>` (e.g., 'llama3', 'chatml', 'mistral'). In Ollama: ensure the Modelfile has the right TEMPLATE block.

#3

Special tokens stripped during quantization

Diagnose

BOS / EOS / pad / system tokens converted to plain text instead of being recognized. Model never stops generating, or starts mid-sentence.

Fix

Use a quant where special tokens are preserved. In llama.cpp: `--special` flag forces special-token handling. Otherwise re-quantize from safetensors source with `convert-hf-to-gguf.py --vocab-only` first to verify.

#4

Custom tokenizer schema not yet supported by runtime

Diagnose

New model architectures (Qwen 3, DeepSeek V3) sometimes ship custom tokenizer extensions that older runtimes don't handle. Errors mention 'unknown special token' or schema mismatch.

Fix

Update runtime to HEAD: `pip install --upgrade transformers`, build llama.cpp from latest commit, etc. Custom tokenizer support lags model release by a few weeks.

#5

Wrong vocabulary used for fine-tune

Diagnose

Fine-tune was trained with extended vocabulary (added tokens) but the runtime sees the base vocabulary. Token IDs above base vocab size error.

Fix

Use the fine-tune's tokenizer, not the base model's. Check `tokenizer_config.json` for `added_tokens_decoder` — fine-tunes often add tokens that the base tokenizer doesn't have.

Frequently asked questions

How do I check if my tokenizer matches my model?

`print(tokenizer.vocab_size, model.config.vocab_size)` — they should match. Also verify chat template: `tokenizer.chat_template` should be non-None for chat models. If either fails, you have a mismatch.

Can I use a different tokenizer with the same model?

Generally no — token IDs are model-specific. Same-family models (Llama 3.0 vs 3.1) often have compatible tokenizers but verify before assuming. Different families (Llama vs Mistral) never share tokenizers.

Why are special tokens so often broken in quants?

Quantization scripts sometimes strip or reorder special tokens during conversion. Always download from reputable converters (bartowski, lmstudio-community, mradermacher on HuggingFace). Random uploaders ship broken tokenizers more often than not.

Related troubleshooting

GGUF tokenizer mismatch / 'tokenizer model not found'

When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

Ollama: 'model not found' / 'pull manifest unknown' errors

Ollama 'model not found' errors trace to typos in the model name, pulling a model that doesn't exist in the official registry, network blocks on the registry, or pulling from a custom registry without auth.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

  • Best GPU for local AI
  • Best laptop for local AI
  • Best Mac for local AI

Where next?

All troubleshooting guides
OrBest GPU for local AIWill it run on my hardware?