12. TensorRT-LLM

Chapter 12 of 18 · 20 min

TensorRT-LLM provides the highest performance inference for NVIDIA GPUs, achieving 2-5× speedup over naive CUDA implementations. It compiles models into optimized CUDA kernels with automatic graph optimization, layer fusion, and precision calibration.

Installation requires matching your CUDA version:

# Check CUDA version
nvidia-smi | grep "CUDA Version"
# Expected: CUDA Version: 12.1 or 12.2

# Clone and build TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive

# Build for your CUDA version
pip install tensorrtllm_backend --extra-index-url https://pypi.nvidia.com

Model compilation converts HuggingFace checkpoints to TensorRT-LLM format:

# compile_model.py
import torch
from tensorrt_llm.models import LLaMAForCausalLM
from tensorrt_llm.hlapi import BuilderConfig, QuantConfig, TRTModel

model = LLaMAForCausalLM.from_hugging_face(
    "meta-llama/Llama-2-70b-hf",
    dtype=torch.float16,
)

# Quantization configuration
quant_config = QuantConfig(
    quant_algo='FP8',      # 8-bit floating point
    kv_cache_quant_algo='FP8',
)

# Compilation settings
builder_config = BuilderConfig(
    quantization=quant_config,
    hardware_compatibility='AMPERE_PLUS',  # RTX 30/40, A100, H100
    enable_fp8=True,
    builder_opt=3,
)

# Convert and optimize
model.compile(builder_config)
model.save("llama-70b-trtllm")

Multi-GPU tensor parallelism:

# Tensor parallelism across 4 GPUs
python -m tensorrt_llm.commands.build \
    --model_dir meta-llama/Llama-2-70b-hf \
    --output_dir ./llama-70b-trtllm-4gpu \
    --quantization fp8 \
    --tensor_parallel 4 \
    --hf_model_convert \
    --max_batch_size 128 \
    --max_input_len 4096 \
    --max_new_tokens 1024

TensorRT-LLM uses custom inference runtime:

# Inference with TensorRT-LLM runtime
from tensorrt_llm.runtime import LLMEngine

engine = LLMEngine.from_dir(
    "llama-70b-trtllm-4gpu",
    temperature=0.8,
    max_output_len=512,
)

# Streaming inference
for output in engine.generate_stream("Explain attention mechanism"):
    print(output.content, end="", flush=True)

Comparison with vLLM:

Metric vLLM TensorRT-LLM
Max throughput High Highest
Latency (p50) ~50ms ~20ms
Multi-GPU scaling Good Excellent
Model support Broad Optimized for Llama, GPT, Mistral
Configuration complexity Medium High
Update frequency Weekly Monthly
EXERCISE

Compile the same model with both FP16 and FP8 quantization in TensorRT-LLM. Measure throughput and latency difference. Calculate the quality impact using perplexity evaluation.