TensorRT-LLM
NVIDIA's first-party inference compiler. Generates optimized engines per model + GPU pair, with the lowest latency on NVIDIA hardware. The pick when you're committed to a single SKU and need the absolute fastest tokens-per-second.
Overview
What TensorRT-LLM actually is
TensorRT-LLM is NVIDIA's first-party LLM inference engine, the production path through which NVIDIA itself benchmarks every Hopper- and Blackwell-class GPU. It is not a wrapper around PyTorch — it is a build pipeline that takes a model definition (Llama, Qwen, Mistral, GPT-J, MoE families, etc.) and a calibration dataset, runs it through TensorRT's graph optimizer, and emits a per-GPU-class engine binary with kernels selected, fused, and pre-tuned for that exact card.
That build-once-run-everywhere-on-the-same-GPU model is its signature. It is also its biggest cost: every change of model, quant, max sequence length, or tensor-parallel topology rebuilds the engine, and the build is not trivially fast. In return you get the highest single-node throughput numbers any inference engine produces on H100 / H200 / B100 / B200 — usually 1.3-2× faster than vLLM at the same precision on the same hardware in 2026.
Where it fits in the stack
TensorRT-LLM lives at the engine layer for production NVIDIA datacenter serving. The canonical stack:
- Frontend: Triton Inference Server, or a custom gRPC / HTTP wrapper
- Engine: TensorRT-LLM Python runtime + the compiled
.engineblob - Hardware: H100 / H200 / B100 / B200 / GB200 — the higher you go on the Hopper-Blackwell ladder, the bigger the relative gap to other engines
- Quant: FP16, BF16, FP8 (Hopper transformer engine), INT4 (AWQ / GPTQ), W4A16
It is not the right engine for consumer cards (use vLLM or ExLlamaV2 on a RTX 4090). It is not the right engine for prototyping or research (the rebuild loop is too slow). It is the engine for a fleet operator who has settled on a model, a quant, and a hardware tier, and wants the single-node throughput ceiling that exists on that hardware.
Best use cases
- 70B / 405B production serving on H100 / H200 clusters. The textbook use case; the 1.3-2× advantage over vLLM compounds across millions of tokens/day.
- FP8 inference on Hopper / Blackwell. TensorRT-LLM's FP8 transformer engine path is the most mature in the ecosystem; nothing else comes close on H100 in 2026.
- Multi-node tensor-parallel + pipeline-parallel. The combined TP+PP path with NCCL + InfiniBand is well-tuned. See /guides/running-local-ai-on-multiple-gpus-2026 and /stacks/h100-tensor-parallel-workstation.
- Speculative decoding in production. Medusa, EAGLE, and draft-model paths are first-class.
OS support
| OS | Quality |
|---|---|
| Ubuntu 22.04 / 24.04 LTS | excellent — the production reference |
| RHEL / Rocky 8/9 | excellent — common enterprise target |
| Other Linux | partial — distro-dependent CUDA / NCCL packaging |
| Windows | not the target — datacenter Linux only |
| macOS | unsupported (no NVIDIA on Apple Silicon) |
The reference deployment is an NVIDIA-container-image-based path inside Triton Inference Server. Bare-metal Python builds work but are not the production-default.
Hardware / backend support
TensorRT-LLM is NVIDIA-only and is architecture-tuned. The supported targets in May 2026:
- H100 / H200 (Hopper) — first-class, FP8 transformer engine fully supported
- B100 / B200 / GB200 (Blackwell) — first-class; FP4 path matures through the year
- L40S / L40 / L4 (Ada) — supported; no FP8 transformer engine
- A100 / A40 (Ampere) — supported; falls back to FP16 / BF16 / INT4
- RTX 4090 / RTX 5090 — supported but engineered for datacenter; using it on consumer is overkill
There is no AMD, no Apple Silicon, no Intel Arc path. For non-NVIDIA hardware, use vLLM (which has wider hardware coverage) or llama.cpp.
Model / quant format support
- FP16 / BF16 — reference baseline; best quality
- FP8 (E4M3 / E5M2) — Hopper-native; the throughput-king path on H100 / H200
- AWQ-INT4 — the Ada / Ampere-friendly INT4 path; calibration-based
- GPTQ-INT4 — supported, slightly behind AWQ in production
- W4A16 weight-only INT4 — for memory-bound serving
- No GGUF, no EXL2, no MLX — out of scope by design
For the cross-runtime quant ladder see /systems/quantization-formats.
Setup path
The reference path is the NVIDIA NGC container:
docker run --gpus all --rm -it \
nvcr.io/nvidia/tensorrt-llm/release:latest
# Inside the container:
pip install tensorrt_llm
trtllm-build --checkpoint_dir <hf_model> \
--output_dir engines/llama70b \
--gemm_plugin auto --max_batch_size 64
For Triton serving, point the Triton model repository at the engines directory and start tritonserver. The complete pipeline is documented in NVIDIA's NGC catalog and the TensorRT-LLM examples repo.
What breaks first
In order of how often you'll hit them:
- Engine rebuild on any config change. Changed max sequence length? Rebuild. Changed TP size? Rebuild. Changed quant? Rebuild. Each rebuild for a 70B-class model takes 10-30 minutes on an H100.
- CUDA / cuDNN / TensorRT version drift. The engine is pinned to a TensorRT version; mixing engine versions across nodes silently corrupts outputs.
- NCCL topology mismatches. Multi-node TP+PP requires explicit NCCL config; misconfigured fabrics tank scaling without erroring.
- FP8 numerical instability on edge architectures. Some MoE routers and novel attention variants need per-layer precision overrides.
- HF model conversion drift. New model architectures land on HF first; TensorRT-LLM's converter sometimes lags by weeks.
Alternatives by intent
| If you want… | Reach for |
|---|---|
| Hardware coverage beyond NVIDIA | vLLM or llama.cpp |
| Faster iteration loop | vLLM (no rebuild step) |
| Best agentic prefix-cache hit rates | SGLang |
| Single-stream consumer-card throughput | ExLlamaV2 on a RTX 4090 |
| Apple Silicon | MLX-LM |
Best pairings
- NVIDIA H100 SXM + TensorRT-LLM + FP8 + 70B model — the production sweet spot
- NVIDIA H200 + TensorRT-LLM + 405B FP8 across 4× H200 — the frontier-self-host path
- Triton Inference Server as the gRPC / HTTP gateway in front of TRT-LLM engines
- NCCL + InfiniBand as the cluster fabric for multi-node serving
Who should avoid TensorRT-LLM
- Solo developers and homelabs. The build loop and infrastructure overhead are not worth it; use vLLM or Ollama.
- Operators on consumer hardware. A RTX 4090 doesn't get FP8 transformer-engine acceleration; the TRT-LLM advantage shrinks dramatically.
- AMD / Apple / Intel ecosystems. Wrong vendor.
- Anyone iterating on model choice or quant choice. Each iteration is a 10-30 minute rebuild; vLLM is a better fit.
- Workloads where the marginal 1.3-2× over vLLM doesn't justify the engineering cost. This is most workloads under 1M tokens/day.
Related
- Stacks: /stacks/h100-tensor-parallel-workstation
- System guides: /systems/quantization-formats, /guides/running-local-ai-on-multiple-gpus-2026
- Hardware: NVIDIA H100 SXM, NVIDIA H200, RTX 5090
- Errors: /errors/wsl2-gpu-not-detected
Setup guidance
Install via the tensorrt_llm Python package in a venv with CUDA 12.4+: pip install tensorrt_llm. Requires the TensorRT SDK (download from NVIDIA Developer) and a matching cuDNN. Start by converting a HuggingFace checkpoint to TensorRT format: first convert weights with python examples/llama/convert_checkpoint.py --model_dir ./Llama-3.1-8B-Instruct --output_dir ./trt_checkpoint --dtype float16, then build the engine: trtllm-build --checkpoint_dir ./trt_checkpoint --output_dir ./trt_engine --gemm_plugin float16. Building a 70B engine takes ~2 hours on 8× H100 — this is a one-time cost per model+GPU combination. Serve with: python examples/run.py --engine_dir ./trt_engine --tokenizer_dir ./Llama-3.1-8B-Instruct --max_output_len 2048. The Triton Inference Server integration is the production path: package the engine as a Triton model repository and serve via the Triton HTTP/gRPC API. Verify with the run.py script or Triton's perf_analyzer. Time-to-first-response after engine build: ~10 seconds for model load + warmup.
Workload fit
Best for: latency-critical production serving on NVIDIA Hopper/Blackwell GPUs where everything else is optimized and engine compilation time is acceptable overhead, cloud deployment on NVIDIA GPU instances where FP8 quantization enables fitting a 70B model on a single H100, enterprise deployments committed to a fixed set of models on a fixed GPU SKU, NVIDIA Triton Inference Server deployments that integrate multiple model types (LLM + embedding + reranker) in one serving pipeline. Not suited for: workflows requiring rapid model switching or daily model iteration (engine build time kills velocity — use vLLM), non-NVIDIA GPU deployments, CPU or Apple Silicon inference, experimentation and prototyping with new model architectures.
Alternatives
Use TensorRT-LLM when maximum single-request latency on NVIDIA datacenter GPUs (H100, H200, B200) is the primary requirement — it wins 15–30% over vLLM on TTFT and per-token decode latency through graph-level fusion and kernel auto-tuning. TensorRT-LLM's FP8 and FP4 quantization support on Hopper/Blackwell is the most mature of any engine — use it when you need to fit larger models into fewer GPUs at minimal quality loss. Switch to vLLM when iteration speed matters: vLLM hot-loads any HuggingFace model in seconds vs TensorRT-LLM's 1–3 hour engine build per model. Use SGLang when prefix caching is your throughput lever. Avoid TensorRT-LLM if you need AMD, Apple Silicon, or CPU backends — NVIDIA-only. Avoid if you iterate on multiple model variants daily — the build cost dominates workflow speed.
Troubleshooting + when to switch
Problem: RuntimeError: TensorRT engine built with version X but runtime is version Y. Fix: TensorRT engines are not forward or backward compatible. Rebuild the engine with the exact same TensorRT version as your runtime. Pin the tensorrt_llm version in your requirements file and rebuild engines on upgrade. Problem: Engine build fails with OOM during weight conversion. Fix: Weight conversion loads the full FP16 model into CPU memory. For 70B models (~140 GB), you need a machine with 256+ GB system RAM. Use --workers 1 to reduce parallelism memory spike, or convert weights on a high-RAM CPU-only node before building on the GPU node. Problem: Inference latency higher than vLLM despite using TRT-LLM. Fix: The default GEMM plugin is float16. Switch to the fp8 plugin on H100/H200 for ~2× throughput. Ensure --use_fp8_context_fmha is enabled for FP8 flash attention on Hopper. Single-request latency wins apply when the engine is correctly configured for the GPU arch — a float16 engine on H100 leaves half the tensor cores idle.
Stack & relationships
How TensorRT-LLM relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.
Alternatives
- Competes withvLLM
TensorRT-LLM compiles to a fixed engine for one GPU SKU; vLLM runs PyTorch kernels with dynamic batching. Pick TensorRT-LLM if you need every microsecond on Hopper/Blackwell.
- Competes withSGLang
Different design philosophies — SGLang is dynamic-batching PyTorch; TensorRT-LLM is compile-once-per-SKU. Pick SGLang for iteration speed and prefix caching; TensorRT-LLM for absolute lowest TTFT on Hopper/Blackwell.
Avoid pairing with
- Works poorly withAnythingLLM
Doable through Triton's OpenAI shim. Operationally heavy; only worth it if you've already invested in the NVIDIA stack.
- Incompatible withMLX-LM
NVIDIA-only vs Apple-only. Same boundary as vLLM↔MLX. Surface explicitly so readers don't assume cross-platform.
Featured in these stacks
The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Production tier·Role: FP8 throughput leader (when committed to NVIDIA stack)Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink
TensorRT-LLM extracts the FP8 advantage that the Ada architecture supports natively. Recompile-per-config friction is real, but for production deployments where the model + quant are stable, TRT-LLM throughput beats vLLM by 15-25%. Use only when committed to the rebuild discipline.
- Stack · L3·Production tier·Role: Peak-throughput runtime (when stable config)4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
TensorRT-LLM extracts an additional 15-25% throughput vs vLLM at the cost of recompile-per-config friction. Use when model + quant + batch size are stable for production deployment; not for development iteration.
Pros
- Peak NVIDIA hardware utilization
- FP8 / FP4 acceleration on Blackwell
Cons
- NVIDIA only
- Compilation step is heavy
Compatibility
| Operating systems | Linux Windows |
| GPU backends | NVIDIA CUDA |
| License | Open source · free |
Runtime health
Operator-grade signals on how actively TensorRT-LLM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
8 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Ecosystem stability
Editorial rating from RunLocalAI — qualitative, not measured.
Get TensorRT-LLM
Frequently asked
Is TensorRT-LLM free?
What operating systems does TensorRT-LLM support?
Which GPUs work with TensorRT-LLM?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify TensorRT-LLM runs on your specific hardware before committing money.