How do you fix "RuntimeError: CUDA error: device-side assert triggered"?

**1. Re-run with the synchronous launcher** to get the actual stack trace: ```bash CUDA_LAUNCH_BLOCKING=1 python your_script.py ``` Without this, the error appears at an unrelated later op because CUDA is async. **2. Find the failing kernel** in the now-synchronous traceback. Most often it's `embedding` or a scatter/gather op. **3. Check token ID bounds.** Print `max(input_ids)` and compare to `model.config.vocab_size`. A mismatch means tokenizer ↔ model mismatch: ```python print(input_ids.max(), model.config.vocab_size) ``` If they don't match, you loaded the tokenizer from one model and weights from another. Re-pair them. **4. If the assert is in attention** (`SoftmaxBackward`, attention masks): your attention mask shape doesn't match input shape. Common after manual padding. **5. Restart the process.** No way to recover the CUDA context once an assert fires: ```bash # Whatever launched it pkill -f "your_script" ```

CUDA / NVIDIA

Verified by owner

RuntimeError: CUDA error: device-side assert triggered

Q: What causes "RuntimeError: CUDA error: device-side assert triggered"?

A kernel hit an `assert` failure on the GPU. The most common assert in inference code is index-out-of-bounds in an embedding lookup — token ID > vocab size — which happens when the wrong tokenizer was paired with the model, or when special token IDs in input exceed what the model expects. Once a device-side assert fires, the CUDA context is poisoned: every subsequent CUDA call surfaces the same error, even unrelated ones. The only fix is restarting the process.

By Fredoline Eruo · Last verified May 8, 2026

Cause

A kernel hit an assert failure on the GPU. The most common assert in inference code is index-out-of-bounds in an embedding lookup — token ID > vocab size — which happens when the wrong tokenizer was paired with the model, or when special token IDs in input exceed what the model expects.

Once a device-side assert fires, the CUDA context is poisoned: every subsequent CUDA call surfaces the same error, even unrelated ones. The only fix is restarting the process.

Solution

1. Re-run with the synchronous launcher to get the actual stack trace:

CUDA_LAUNCH_BLOCKING=1 python your_script.py

Without this, the error appears at an unrelated later op because CUDA is async.

2. Find the failing kernel in the now-synchronous traceback. Most often it's embedding or a scatter/gather op.

3. Check token ID bounds. Print max(input_ids) and compare to model.config.vocab_size. A mismatch means tokenizer ↔ model mismatch:

print(input_ids.max(), model.config.vocab_size)

If they don't match, you loaded the tokenizer from one model and weights from another. Re-pair them.

4. If the assert is in attention (SoftmaxBackward, attention masks): your attention mask shape doesn't match input shape. Common after manual padding.

5. Restart the process. No way to recover the CUDA context once an assert fires:

# Whatever launched it
pkill -f "your_script"

Related errors

PyTorch: CUDA error: no kernel image is available for execution on the device

Did this fix it?

If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.