Hardware & infrastructure

TensorRT

TensorRT is NVIDIA's SDK for optimizing and deploying deep learning models on NVIDIA GPUs. It performs graph optimization, kernel fusion, and precision calibration (FP16, INT8, INT4) to reduce latency and memory usage. Operators encounter TensorRT when they want to maximize inference throughput on RTX or data-center GPUs, typically by converting a model from PyTorch or ONNX into a TensorRT engine. The engine is hardware-specific: a build for an RTX 4090 won't run on an RTX 3060. TensorRT is not a runtime like llama.cpp; it's a compiler that produces a deployable engine.

Deeper dive

TensorRT works by taking a trained model (often in ONNX format) and applying a series of optimizations. First, it fuses layers—combining operations like convolution + batch normalization + ReLU into a single kernel—to reduce launch overhead. Second, it selects the fastest kernel for each layer given the target GPU architecture (e.g., Ampere, Ada Lovelace). Third, it can quantize weights and activations to lower precision (FP16, INT8, INT4) using a calibration dataset, trading accuracy for speed. The output is a serialized engine file (.plan) that can be loaded and run with the TensorRT runtime. For LLMs, TensorRT-LLM extends this with paged attention, in-flight batching, and multi-GPU support. Operators should note that building an engine takes minutes to hours, and the engine is tied to a specific CUDA version, TensorRT version, and GPU architecture. Rebuilding is required when any of those change.

Practical example

An operator running Llama 3.1 8B on an RTX 4090 (24 GB VRAM) might use TensorRT-LLM to build an INT4 engine. The build process converts the Hugging Face model to TensorRT format, calibrates with a small dataset, and produces a ~5 GB engine. Inference then runs at ~200 tok/s, compared to ~100 tok/s with llama.cpp Q4_K_M on the same GPU. However, the engine is not portable: moving it to an RTX 3060 (12 GB) would require a new build targeting that GPU's compute capability (8.6 vs 8.9).

Workflow example

In practice, an operator using vLLM with TensorRT-LLM backend would: 1) Install TensorRT-LLM and its dependencies. 2) Run trtllm-build --model_dir ./llama-8b --output_dir ./engine --dtype float16 --use_weight_only --weight_only_precision int4 to build the engine. 3) Serve with trtllm-serve --engine_dir ./engine. The build step is the main friction: it requires matching CUDA and TensorRT versions, and can fail if the GPU doesn't have enough VRAM for the intermediate representation. Operators often script this build as a one-time setup per GPU model.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work