RuntimeError: CUDA error: device-side assert triggered
Cause
A kernel hit an assert failure on the GPU. The most common assert in inference code is index-out-of-bounds in an embedding lookup — token ID > vocab size — which happens when the wrong tokenizer was paired with the model, or when special token IDs in input exceed what the model expects.
Once a device-side assert fires, the CUDA context is poisoned: every subsequent CUDA call surfaces the same error, even unrelated ones. The only fix is restarting the process.
Solution
1. Re-run with the synchronous launcher to get the actual stack trace:
CUDA_LAUNCH_BLOCKING=1 python your_script.py
Without this, the error appears at an unrelated later op because CUDA is async.
2. Find the failing kernel in the now-synchronous traceback. Most often it's embedding or a scatter/gather op.
3. Check token ID bounds. Print max(input_ids) and compare to model.config.vocab_size. A mismatch means tokenizer ↔ model mismatch:
print(input_ids.max(), model.config.vocab_size)
If they don't match, you loaded the tokenizer from one model and weights from another. Re-pair them.
4. If the assert is in attention (SoftmaxBackward, attention masks): your attention mask shape doesn't match input shape. Common after manual padding.
5. Restart the process. No way to recover the CUDA context once an assert fires:
# Whatever launched it
pkill -f "your_script"
Related errors
Did this fix it?
If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.