GGUF model outputs garbage — tokenizer / chat-template mismatch — fix and explanation

Q: What causes "GGUF model outputs garbage — tokenizer / chat-template mismatch"?

**Environment:** [llama.cpp](/tools/llama-cpp), [Ollama](/tools/ollama), [LM Studio](/tools/lm-studio), [koboldcpp](/tools/koboldcpp) running GGUF files. **Severity: medium** — model loads but output is unusable. - GGUF was converted with an old llama.cpp that didn't bundle the right tokenizer (Tekken for Mistral Nemo, GPT-2 BPE merges for Llama 3) - Chat template baked into the GGUF doesn't match the model — Modelfile override needed - BOS token policy wrong (`add_bos_token` true on a model trained without BOS) - Special tokens (` `, ` `) not registered as stop tokens — generation runs past end-of-turn - LoRA-merged GGUF where the merge didn't update tokenizer metadata

Q: How do you fix "GGUF model outputs garbage — tokenizer / chat-template mismatch"?

**1. Re-pull the GGUF from a maintainer who tracks tokenizer updates:** ```bash ollama pull mistral-nemo:12b-instruct-2407 # or hf download bartowski/Mistral-Nemo-Instruct-2407-GGUF \ Mistral-Nemo-Instruct-2407-Q4_K_M.gguf ``` **2. Inspect the embedded chat template:** ```bash ./llama-cli -m model.gguf --chat-template-file /dev/stdout --interactive # or ollama show llama3.1:8b --modelfile | grep -A5 TEMPLATE ``` **3. Override with the correct template via Modelfile:** ``` FROM ./model.gguf TEMPLATE """ system {{ .System }} user {{ .Prompt }} assistant """ PARAMETER stop " " PARAMETER stop " " ``` ```bash ollama create llama3.1:8b-fixed -f Modelfile ``` **4. Disable BOS injection** if model was trained without one: ```bash ./llama-cli -m model.gguf --no-bos -p "Hello" ``` **5. Update llama.cpp + reconvert** if the tokenizer is genuinely missing tokens (rare but happens with research models): ```bash git pull && make clean && make -j python convert_hf_to_gguf.py /path/to/hf-model ```

Cause

Environment: llama.cpp, Ollama, LM Studio, koboldcpp running GGUF files.

Severity: medium — model loads but output is unusable.

GGUF was converted with an old llama.cpp that didn't bundle the right tokenizer (Tekken for Mistral Nemo, GPT-2 BPE merges for Llama 3)
Chat template baked into the GGUF doesn't match the model — Modelfile override needed
BOS token policy wrong (add_bos_token true on a model trained without BOS)
Special tokens (<|eot_id|>, <|im_end|>) not registered as stop tokens — generation runs past end-of-turn
LoRA-merged GGUF where the merge didn't update tokenizer metadata

Solution

1. Re-pull the GGUF from a maintainer who tracks tokenizer updates:

ollama pull mistral-nemo:12b-instruct-2407
# or
hf download bartowski/Mistral-Nemo-Instruct-2407-GGUF \
  Mistral-Nemo-Instruct-2407-Q4_K_M.gguf

2. Inspect the embedded chat template:

./llama-cli -m model.gguf --chat-template-file /dev/stdout --interactive
# or
ollama show llama3.1:8b --modelfile | grep -A5 TEMPLATE

3. Override with the correct template via Modelfile:

FROM ./model.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

ollama create llama3.1:8b-fixed -f Modelfile

4. Disable BOS injection if model was trained without one:

./llama-cli -m model.gguf --no-bos -p "Hello"

5. Update llama.cpp + reconvert if the tokenizer is genuinely missing tokens (rare but happens with research models):

git pull && make clean && make -j
python convert_hf_to_gguf.py /path/to/hf-model

GGUF model outputs garbage — tokenizer / chat-template mismatch

Cause

Solution

Related errors

Did this fix it?