fatalEditorialReviewed May 2026

TensorRT-LLM build failed — fix the engine compilation

TensorRT-LLM compilation/build failures: missing CUDA arch flag, version mismatches, Python wheel OOM, and NVCC compute capability issues. Honest advice: for most users, vLLM is the saner path.

TensorRT-LLMNVIDIA CUDAWindowsLinux
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

CUDA architecture flag missing or wrong

Diagnose

Build error mentions `nvcc fatal: Unsupported GPU architecture` or the compiled engine runs but uses fallback kernels that are 3-10x slower. You didn't specify your GPU's compute capability.

Fix

Add `-DGPU_ARCHS=<your-arch>` to the build command. RTX 30-series = 86, 40-series = 89, 50-series = 100. Example for 4090: `python build.py --model_dir ./model ... -DGPU_ARCHS=89`. Without this flag, TensorRT-LLM compiles for the broadest compatibility, losing most of the speed gain.

#2

TensorRT version mismatch with installed CUDA version

Diagnose

Import errors on `import tensorrt_llm`: `libnvinfer.so.10 not found` or version mismatch warnings. TensorRT and CUDA versions are tightly coupled.

Fix

Check the compatibility matrix on NVIDIA's TensorRT documentation. TensorRT 10.x requires CUDA 12.4-12.5. TensorRT 9.x requires CUDA 12.2-12.3. Pin both versions in your environment. Use the official TensorRT-LLM Docker image (`nvcr.io/nvidia/tensorrt-llm/release`) to bypass version hell entirely.

#3

Python wheel compilation runs out of system RAM (not VRAM)

Diagnose

Build hard-kills (not OOM, full kill by OOM killer) during `pip install` or `python build.py`. `dmesg` shows `Out of memory: Killed process`. Compiling the TRT engine needs 32-64 GB of system RAM for large models.

Fix

Increase swap space (`sudo fallocate -l 32G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile`). Close other memory-heavy processes. Or use a machine with more RAM — TRT-LLM compilation for 70B+ models can require 64 GB+ system RAM.

#4

NVCC compute capability override needed but not set

Diagnose

Build succeeds but engine performance is 5-10x slower than expected. The engine compiled without SM-specific optimizations for your GPU.

Fix

Set `TORCH_CUDA_ARCH_LIST` environment variable before building: `export TORCH_CUDA_ARCH_LIST='8.9'` (for 4090) or `'8.6;8.9'` (for multi-arch). Also confirm `nvcc --version` shows the right CUDA version. Run the built engine with `--log_level=verbose` to see which kernels are being dispatched.

Frequently asked questions

Is TensorRT-LLM worth it over vLLM?

For most users, no. vLLM is easier to set up, has better community support, and reaches 80-90% of TensorRT-LLM's peak throughput for most workloads. TensorRT-LLM shines in two scenarios: (1) you're serving at scale and the 10-20% throughput gain pays for the engineering time, (2) you're running on Jetson or small embedded GPUs where TRT's optimizations matter most.

What's the simplest way to get TensorRT-LLM working?

Use the official Docker image: `docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:latest`. This ships with all dependencies pre-built. Follow the examples in the container first, then adapt to your model. Building from source outside Docker is a rite of passage for masochists.

Does TensorRT-LLM work on Windows?

Officially, no. TensorRT-LLM's build system assumes Linux. There are community workarounds via WSL2, but you're adding another compatibility layer on top of an already fragile build chain. If you need Windows + NVIDIA serving, use vLLM or llama.cpp.

Related troubleshooting

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: