TensorRT-LLM — Model Optimization for Local Inference (Chapter 12)

TensorRT-LLM provides the highest performance inference for NVIDIA GPUs, achieving 2-5× speedup over naive CUDA implementations. It compiles models into optimized CUDA kernels with automatic graph optimization, layer fusion, and precision calibration.

Installation requires matching your CUDA version:

# Check CUDA version
nvidia-smi | grep "CUDA Version"
# Expected: CUDA Version: 12.1 or 12.2

# Clone and build TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive

# Build for your CUDA version
pip install tensorrtllm_backend --extra-index-url https://pypi.nvidia.com

Model compilation converts HuggingFace checkpoints to TensorRT-LLM format:

# compile_model.py
import torch
from tensorrt_llm.models import LLaMAForCausalLM
from tensorrt_llm.hlapi import BuilderConfig, QuantConfig, TRTModel

model = LLaMAForCausalLM.from_hugging_face(
    "meta-llama/Llama-2-70b-hf",
    dtype=torch.float16,
)

# Quantization configuration
quant_config = QuantConfig(
    quant_algo='FP8',      # 8-bit floating point
    kv_cache_quant_algo='FP8',
)

# Compilation settings
builder_config = BuilderConfig(
    quantization=quant_config,
    hardware_compatibility='AMPERE_PLUS',  # RTX 30/40, A100, H100
    enable_fp8=True,
    builder_opt=3,
)

# Convert and optimize
model.compile(builder_config)
model.save("llama-70b-trtllm")

Multi-GPU tensor parallelism:

# Tensor parallelism across 4 GPUs
python -m tensorrt_llm.commands.build \
    --model_dir meta-llama/Llama-2-70b-hf \
    --output_dir ./llama-70b-trtllm-4gpu \
    --quantization fp8 \
    --tensor_parallel 4 \
    --hf_model_convert \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_new_tokens 1024

TensorRT-LLM uses custom inference runtime:

# Inference with TensorRT-LLM runtime
from tensorrt_llm.runtime import LLMEngine

engine = LLMEngine.from_dir(
    "llama-70b-trtllm-4gpu",
    temperature=0.8,
    max_output_len=512,
)

# Streaming inference
for output in engine.generate_stream("Explain attention mechanism"):
    print(output.content, end="", flush=True)

Comparison with vLLM:

Metric	vLLM	TensorRT-LLM
Max throughput	High	Highest
Latency (p50)	~50ms	~20ms
Multi-GPU scaling	Good	Excellent
Model support	Broad	Optimized for Llama, GPT, Mistral
Configuration complexity	Medium	High
Update frequency	Weekly	Monthly