RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /TensorRT-LLM
server
Open source
free
4.3/5

TensorRT-LLM

NVIDIA's first-party inference compiler. Generates optimized engines per model + GPU pair, with the lowest latency on NVIDIA hardware. The pick when you're committed to a single SKU and need the absolute fastest tokens-per-second.

By Fredoline Eruo·Last verified Jun 12, 2026·12,000 GitHub stars

Overview

What TensorRT-LLM actually is

TensorRT-LLM is NVIDIA's first-party LLM inference engine, the production path through which NVIDIA itself benchmarks every Hopper- and Blackwell-class GPU. It is not a wrapper around PyTorch — it is a build pipeline that takes a model definition (Llama, Qwen, Mistral, GPT-J, MoE families, etc.) and a calibration dataset, runs it through TensorRT's graph optimizer, and emits a per-GPU-class engine binary with kernels selected, fused, and pre-tuned for that exact card.

That build-once-run-everywhere-on-the-same-GPU model is its signature. It is also its biggest cost: every change of model, quant, max sequence length, or tensor-parallel topology rebuilds the engine, and the build is not trivially fast. In return you get the highest single-node throughput numbers any inference engine produces on H100 / H200 / B100 / B200 — usually 1.3-2× faster than vLLM at the same precision on the same hardware in 2026.

Where it fits in the stack

TensorRT-LLM lives at the engine layer for production NVIDIA datacenter serving. The canonical stack:

  • Frontend: Triton Inference Server, or a custom gRPC / HTTP wrapper
  • Engine: TensorRT-LLM Python runtime + the compiled .engine blob
  • Hardware: H100 / H200 / B100 / B200 / GB200 — the higher you go on the Hopper-Blackwell ladder, the bigger the relative gap to other engines
  • Quant: FP16, BF16, FP8 (Hopper transformer engine), INT4 (AWQ / GPTQ), W4A16

It is not the right engine for consumer cards (use vLLM or ExLlamaV2 on a RTX 4090). It is not the right engine for prototyping or research (the rebuild loop is too slow). It is the engine for a fleet operator who has settled on a model, a quant, and a hardware tier, and wants the single-node throughput ceiling that exists on that hardware.

Best use cases

  • 70B / 405B production serving on H100 / H200 clusters. The textbook use case; the 1.3-2× advantage over vLLM compounds across millions of tokens/day.
  • FP8 inference on Hopper / Blackwell. TensorRT-LLM's FP8 transformer engine path is the most mature in the ecosystem; nothing else comes close on H100 in 2026.
  • Multi-node tensor-parallel + pipeline-parallel. The combined TP+PP path with NCCL + InfiniBand is well-tuned. See /guides/running-local-ai-on-multiple-gpus-2026 and /stacks/h100-tensor-parallel-workstation.
  • Speculative decoding in production. Medusa, EAGLE, and draft-model paths are first-class.

OS support

OS Quality
Ubuntu 22.04 / 24.04 LTS excellent — the production reference
RHEL / Rocky 8/9 excellent — common enterprise target
Other Linux partial — distro-dependent CUDA / NCCL packaging
Windows not the target — datacenter Linux only
macOS unsupported (no NVIDIA on Apple Silicon)

The reference deployment is an NVIDIA-container-image-based path inside Triton Inference Server. Bare-metal Python builds work but are not the production-default.

Hardware / backend support

TensorRT-LLM is NVIDIA-only and is architecture-tuned. The supported targets in May 2026:

  • H100 / H200 (Hopper) — first-class, FP8 transformer engine fully supported
  • B100 / B200 / GB200 (Blackwell) — first-class; FP4 path matures through the year
  • L40S / L40 / L4 (Ada) — supported; no FP8 transformer engine
  • A100 / A40 (Ampere) — supported; falls back to FP16 / BF16 / INT4
  • RTX 4090 / RTX 5090 — supported but engineered for datacenter; using it on consumer is overkill

There is no AMD, no Apple Silicon, no Intel Arc path. For non-NVIDIA hardware, use vLLM (which has wider hardware coverage) or llama.cpp.

Model / quant format support

  • FP16 / BF16 — reference baseline; best quality
  • FP8 (E4M3 / E5M2) — Hopper-native; the throughput-king path on H100 / H200
  • AWQ-INT4 — the Ada / Ampere-friendly INT4 path; calibration-based
  • GPTQ-INT4 — supported, slightly behind AWQ in production
  • W4A16 weight-only INT4 — for memory-bound serving
  • No GGUF, no EXL2, no MLX — out of scope by design

For the cross-runtime quant ladder see /systems/quantization-formats.

Setup path

The reference path is the NVIDIA NGC container:

docker run --gpus all --rm -it \
  nvcr.io/nvidia/tensorrt-llm/release:latest

# Inside the container:
pip install tensorrt_llm
trtllm-build --checkpoint_dir <hf_model> \
  --output_dir engines/llama70b \
  --gemm_plugin auto --max_batch_size 64

For Triton serving, point the Triton model repository at the engines directory and start tritonserver. The complete pipeline is documented in NVIDIA's NGC catalog and the TensorRT-LLM examples repo.

What breaks first

In order of how often you'll hit them:

  1. Engine rebuild on any config change. Changed max sequence length? Rebuild. Changed TP size? Rebuild. Changed quant? Rebuild. Each rebuild for a 70B-class model takes 10-30 minutes on an H100.
  2. CUDA / cuDNN / TensorRT version drift. The engine is pinned to a TensorRT version; mixing engine versions across nodes silently corrupts outputs.
  3. NCCL topology mismatches. Multi-node TP+PP requires explicit NCCL config; misconfigured fabrics tank scaling without erroring.
  4. FP8 numerical instability on edge architectures. Some MoE routers and novel attention variants need per-layer precision overrides.
  5. HF model conversion drift. New model architectures land on HF first; TensorRT-LLM's converter sometimes lags by weeks.

Alternatives by intent

If you want… Reach for
Hardware coverage beyond NVIDIA vLLM or llama.cpp
Faster iteration loop vLLM (no rebuild step)
Best agentic prefix-cache hit rates SGLang
Single-stream consumer-card throughput ExLlamaV2 on a RTX 4090
Apple Silicon MLX-LM

Best pairings

  • NVIDIA H100 SXM + TensorRT-LLM + FP8 + 70B model — the production sweet spot
  • NVIDIA H200 + TensorRT-LLM + 405B FP8 across 4× H200 — the frontier-self-host path
  • Triton Inference Server as the gRPC / HTTP gateway in front of TRT-LLM engines
  • NCCL + InfiniBand as the cluster fabric for multi-node serving

Who should avoid TensorRT-LLM

  • Solo developers and homelabs. The build loop and infrastructure overhead are not worth it; use vLLM or Ollama.
  • Operators on consumer hardware. A RTX 4090 doesn't get FP8 transformer-engine acceleration; the TRT-LLM advantage shrinks dramatically.
  • AMD / Apple / Intel ecosystems. Wrong vendor.
  • Anyone iterating on model choice or quant choice. Each iteration is a 10-30 minute rebuild; vLLM is a better fit.
  • Workloads where the marginal 1.3-2× over vLLM doesn't justify the engineering cost. This is most workloads under 1M tokens/day.

Related

  • Stacks: /stacks/h100-tensor-parallel-workstation
  • System guides: /systems/quantization-formats, /guides/running-local-ai-on-multiple-gpus-2026
  • Hardware: NVIDIA H100 SXM, NVIDIA H200, RTX 5090
  • Errors: /errors/wsl2-gpu-not-detected

Setup guidance

Install via the tensorrt_llm Python package in a venv with CUDA 12.4+: pip install tensorrt_llm. Requires the TensorRT SDK (download from NVIDIA Developer) and a matching cuDNN. Start by converting a HuggingFace checkpoint to TensorRT format: first convert weights with python examples/llama/convert_checkpoint.py --model_dir ./Llama-3.1-8B-Instruct --output_dir ./trt_checkpoint --dtype float16, then build the engine: trtllm-build --checkpoint_dir ./trt_checkpoint --output_dir ./trt_engine --gemm_plugin float16. Building a 70B engine takes ~2 hours on 8× H100 — this is a one-time cost per model+GPU combination. Serve with: python examples/run.py --engine_dir ./trt_engine --tokenizer_dir ./Llama-3.1-8B-Instruct --max_output_len 2048. The Triton Inference Server integration is the production path: package the engine as a Triton model repository and serve via the Triton HTTP/gRPC API. Verify with the run.py script or Triton's perf_analyzer. Time-to-first-response after engine build: ~10 seconds for model load + warmup.

Workload fit

Best for: latency-critical production serving on NVIDIA Hopper/Blackwell GPUs where everything else is optimized and engine compilation time is acceptable overhead, cloud deployment on NVIDIA GPU instances where FP8 quantization enables fitting a 70B model on a single H100, enterprise deployments committed to a fixed set of models on a fixed GPU SKU, NVIDIA Triton Inference Server deployments that integrate multiple model types (LLM + embedding + reranker) in one serving pipeline. Not suited for: workflows requiring rapid model switching or daily model iteration (engine build time kills velocity — use vLLM), non-NVIDIA GPU deployments, CPU or Apple Silicon inference, experimentation and prototyping with new model architectures.

Alternatives

Use TensorRT-LLM when maximum single-request latency on NVIDIA datacenter GPUs (H100, H200, B200) is the primary requirement — it wins 15–30% over vLLM on TTFT and per-token decode latency through graph-level fusion and kernel auto-tuning. TensorRT-LLM's FP8 and FP4 quantization support on Hopper/Blackwell is the most mature of any engine — use it when you need to fit larger models into fewer GPUs at minimal quality loss. Switch to vLLM when iteration speed matters: vLLM hot-loads any HuggingFace model in seconds vs TensorRT-LLM's 1–3 hour engine build per model. Use SGLang when prefix caching is your throughput lever. Avoid TensorRT-LLM if you need AMD, Apple Silicon, or CPU backends — NVIDIA-only. Avoid if you iterate on multiple model variants daily — the build cost dominates workflow speed.

Troubleshooting + when to switch

Problem: RuntimeError: TensorRT engine built with version X but runtime is version Y. Fix: TensorRT engines are not forward or backward compatible. Rebuild the engine with the exact same TensorRT version as your runtime. Pin the tensorrt_llm version in your requirements file and rebuild engines on upgrade. Problem: Engine build fails with OOM during weight conversion. Fix: Weight conversion loads the full FP16 model into CPU memory. For 70B models (~140 GB), you need a machine with 256+ GB system RAM. Use --workers 1 to reduce parallelism memory spike, or convert weights on a high-RAM CPU-only node before building on the GPU node. Problem: Inference latency higher than vLLM despite using TRT-LLM. Fix: The default GEMM plugin is float16. Switch to the fp8 plugin on H100/H200 for ~2× throughput. Ensure --use_fp8_context_fmha is enabled for FP8 flash attention on Hopper. Single-request latency wins apply when the engine is correctly configured for the GPU arch — a float16 engine on H100 leaves half the tensor cores idle.

Stack & relationships

How TensorRT-LLM relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.

TensorRT-LLM ↔ ecosystem

Alternatives

  • Competes with
    vLLM

    TensorRT-LLM compiles to a fixed engine for one GPU SKU; vLLM runs PyTorch kernels with dynamic batching. Pick TensorRT-LLM if you need every microsecond on Hopper/Blackwell.

  • Competes with
    SGLang

    Different design philosophies — SGLang is dynamic-batching PyTorch; TensorRT-LLM is compile-once-per-SKU. Pick SGLang for iteration speed and prefix caching; TensorRT-LLM for absolute lowest TTFT on Hopper/Blackwell.

Avoid pairing with

  • Works poorly with
    AnythingLLM

    Doable through Triton's OpenAI shim. Operationally heavy; only worth it if you've already invested in the NVIDIA stack.

  • Incompatible with
    MLX-LM

    NVIDIA-only vs Apple-only. Same boundary as vLLM↔MLX. Surface explicitly so readers don't assume cross-platform.

Featured in these stacks

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Production tier·Role: FP8 throughput leader (when committed to NVIDIA stack)
    Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink

    TensorRT-LLM extracts the FP8 advantage that the Ada architecture supports natively. Recompile-per-config friction is real, but for production deployments where the model + quant are stable, TRT-LLM throughput beats vLLM by 15-25%. Use only when committed to the rebuild discipline.

  • Stack · L3·Production tier·Role: Peak-throughput runtime (when stable config)
    4× H100 SXM tensor-parallel workstation — frontier MoE serving reference

    TensorRT-LLM extracts an additional 15-25% throughput vs vLLM at the cost of recompile-per-config friction. Use when model + quant + batch size are stable for production deployment; not for development iteration.

Pros

  • Peak NVIDIA hardware utilization
  • FP8 / FP4 acceleration on Blackwell

Cons

  • NVIDIA only
  • Compilation step is heavy

Compatibility

Operating systems
Linux
Windows
GPU backends
NVIDIA CUDA
LicenseOpen source · free

Runtime health

Operator-grade signals on how actively TensorRT-LLM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated Jun 12, 2026

8 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Ecosystem stability

Editorial rating from RunLocalAI — qualitative, not measured.

4.3/5✓Editorial

Get TensorRT-LLM

GitHub
https://github.com/NVIDIA/TensorRT-LLM

Frequently asked

Is TensorRT-LLM free?

Yes — TensorRT-LLM is free to use and open-source.

What operating systems does TensorRT-LLM support?

TensorRT-LLM supports Linux, Windows.

Which GPUs work with TensorRT-LLM?

TensorRT-LLM supports NVIDIA CUDA. CPU-only operation is also possible but typically slower.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 4090 vs RTX 5090 →
  • Dual 3090 vs RTX 5090 (tensor-parallel) →
  • RTX 5090 vs H100 →
Buyer guides
  • Best GPU for local AI →
  • Best AI PC build under $2,000 →
When it doesn't work
  • vLLM CUDA version mismatch →
  • Tensor parallelism crash →
  • CUDA driver too old →
  • CUDA out of memory →
Recommended hardware
  • RTX 4090 (24 GB) →
  • RTX 5090 (32 GB) →
  • H100 PCIe (datacenter) →
Alternatives
SGLangText Generation Inference (TGI)vLLMQdrantWeaviateGraphiti (Zep)LanceDBRedis (vector search)
Before you buy

Verify TensorRT-LLM runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →