ROCm HSA status error — recover an AMD GPU mid-inference
HSA / HIP errors mid-inference on AMD GPUs usually trace to thermal limits, kernel-driver mismatch, or known-bad memory modes on consumer cards. Here's the diagnostic order.
Diagnostic order — most likely first
Card thermal throttle hitting an unstable clock state
Crash correlates with sustained load. `rocm-smi` shows GPU temp > 95°C right before crash. Reproducible.
Improve case airflow. Underclock VRAM by 100 MHz: `rocm-smi --setmclk 7` (numbers vary by card). Use undervolt profile in MorePowerTool (RDNA 3) to lower thermals.
Kernel + ROCm version mismatch after distro update
Was working last week. After `apt upgrade` it broke. `dkms status` shows the AMDGPU module status as 'failed' or version skew with running kernel.
Reinstall with matching versions: `sudo amdgpu-install --usecase=rocm,dkms --no-dkms` then reboot. For consumer cards on rolling distros (Arch), pin the kernel version against the ROCm release.
RDNA 3 memory bandwidth bug at high context
7900 XTX specifically. Crashes only at long context (>16K) on certain models. Known issue tracked in llama.cpp + ROCm GH issues.
Use a Q5_K_M quant instead of Q4 (slightly different memory access pattern). Or cap context at 8K for affected models. Or build llama.cpp HEAD with the latest ROCm patches.
Missing HSA gfx-version override for the card
`rocminfo` shows your card's gfx version (gfx1100, gfx1030, etc.) but ROCm libraries reject it because the binary distribution doesn't ship that target.
Set `HSA_OVERRIDE_GFX_VERSION=11.0.0` (RDNA 3) or `10.3.0` (RDNA 2) in your shell or systemd service. Many bundled ROCm builds need this for consumer cards.
Insufficient PCIe bandwidth (laptop or x4 slot)
Crashes during model load or first inference. `lspci -vv | grep LnkSta` shows the GPU at PCIe 3.0 x4 instead of 4.0 x16.
Move card to a full-length x16 slot if available. For laptops with M.2-eGPU adapters, this is a hardware limitation — only short-context inference is reliable.
Frequently asked questions
Is ROCm production-ready on consumer AMD GPUs in 2026?
On RDNA 3 (7900 XTX, 7900 XT) with the gfx-version override and ROCm 6.x — yes for inference, with caveats. On RDNA 2 — workable but more friction. On older cards — use Vulkan via llama.cpp instead.
Should I switch to Vulkan if ROCm keeps crashing?
Yes. llama.cpp's Vulkan backend (`-DGGML_VULKAN=ON`) achieves 70-90% of ROCm performance for inference and is dramatically more stable on consumer AMD cards. The trade-off: no PyTorch / Transformers support.
Can I run ROCm and CUDA on the same machine (multi-vendor GPU)?
Technically yes, but driver coexistence is fragile. Most operators dedicate a machine to one vendor. If you must, install in isolated containers.
Related troubleshooting
ROCm is finicky on consumer AMD GPUs in 2026. Here's the install order, the gfx-version override that fixes 80% of detection failures, and when to give up and use Vulkan.
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: