RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Troubleshooting
  4. /TensorRT-LLM build failed / TensorRT-LLM compilation failed
fatal✓Editorial·Reviewed May 2026

TensorRT-LLM build failed — fix the engine compilation

TensorRT-LLM compilation/build failures: missing CUDA arch flag, version mismatches, Python wheel OOM, and NVCC compute capability issues. Honest advice: for most users, vLLM is the saner path.

TensorRT-LLMNVIDIA CUDAWindowsLinux
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

CUDA architecture flag missing or wrong

Diagnose

Build error mentions `nvcc fatal: Unsupported GPU architecture` or the compiled engine runs but uses fallback kernels that are 3-10x slower. You didn't specify your GPU's compute capability.

Fix

Add `-DGPU_ARCHS=<your-arch>` to the build command. RTX 30-series = 86, 40-series = 89, 50-series = 100. Example for 4090: `python build.py --model_dir ./model ... -DGPU_ARCHS=89`. Without this flag, TensorRT-LLM compiles for the broadest compatibility, losing most of the speed gain.

#2

TensorRT version mismatch with installed CUDA version

Diagnose

Import errors on `import tensorrt_llm`: `libnvinfer.so.10 not found` or version mismatch warnings. TensorRT and CUDA versions are tightly coupled.

Fix

Check the compatibility matrix on NVIDIA's TensorRT documentation. TensorRT 10.x requires CUDA 12.4-12.5. TensorRT 9.x requires CUDA 12.2-12.3. Pin both versions in your environment. Use the official TensorRT-LLM Docker image (`nvcr.io/nvidia/tensorrt-llm/release`) to bypass version hell entirely.

#3

Python wheel compilation runs out of system RAM (not VRAM)

Diagnose

Build hard-kills (not OOM, full kill by OOM killer) during `pip install` or `python build.py`. `dmesg` shows `Out of memory: Killed process`. Compiling the TRT engine needs 32-64 GB of system RAM for large models.

Fix

Increase swap space (`sudo fallocate -l 32G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile`). Close other memory-heavy processes. Or use a machine with more RAM — TRT-LLM compilation for 70B+ models can require 64 GB+ system RAM.

#4

NVCC compute capability override needed but not set

Diagnose

Build succeeds but engine performance is 5-10x slower than expected. The engine compiled without SM-specific optimizations for your GPU.

Fix

Set `TORCH_CUDA_ARCH_LIST` environment variable before building: `export TORCH_CUDA_ARCH_LIST='8.9'` (for 4090) or `'8.6;8.9'` (for multi-arch). Also confirm `nvcc --version` shows the right CUDA version. Run the built engine with `--log_level=verbose` to see which kernels are being dispatched.

Frequently asked questions

Is TensorRT-LLM worth it over vLLM?

For most users, no. vLLM is easier to set up, has better community support, and reaches 80-90% of TensorRT-LLM's peak throughput for most workloads. TensorRT-LLM shines in two scenarios: (1) you're serving at scale and the 10-20% throughput gain pays for the engineering time, (2) you're running on Jetson or small embedded GPUs where TRT's optimizations matter most.

What's the simplest way to get TensorRT-LLM working?

Use the official Docker image: `docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:latest`. This ships with all dependencies pre-built. Follow the examples in the container first, then adapt to your model. Building from source outside Docker is a rite of passage for masochists.

Does TensorRT-LLM work on Windows?

Officially, no. TensorRT-LLM's build system assumes Linux. There are community workarounds via WSL2, but you're adding another compatibility layer on top of an already fragile build chain. If you need Windows + NVIDIA serving, use vLLM or llama.cpp.

Related troubleshooting

vLLM worker crashed / vLLM scheduler crash

vLLM worker/scheduler crashes: KV cache fraction misconfiguration, max-model-len exceeding VRAM, worker timeouts, NCCL failures, and quant incompatibility. The exact fix order that production operators use.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

  • Best GPU for local AI
  • Best laptop for local AI
  • Best Mac for local AI

Where next?

All troubleshooting guides
OrBest GPU for local AIWill it run on my hardware?