12. TensorRT-LLM
Chapter 12 of 18 · 20 min
TensorRT-LLM provides the highest performance inference for NVIDIA GPUs, achieving 2-5× speedup over naive CUDA implementations. It compiles models into optimized CUDA kernels with automatic graph optimization, layer fusion, and precision calibration.
Installation requires matching your CUDA version:
# Check CUDA version
nvidia-smi | grep "CUDA Version"
# Expected: CUDA Version: 12.1 or 12.2
# Clone and build TensorRT-LLM
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
# Build for your CUDA version
pip install tensorrtllm_backend --extra-index-url https://pypi.nvidia.com
Model compilation converts HuggingFace checkpoints to TensorRT-LLM format:
# compile_model.py
import torch
from tensorrt_llm.models import LLaMAForCausalLM
from tensorrt_llm.hlapi import BuilderConfig, QuantConfig, TRTModel
model = LLaMAForCausalLM.from_hugging_face(
"meta-llama/Llama-2-70b-hf",
dtype=torch.float16,
)
# Quantization configuration
quant_config = QuantConfig(
quant_algo='FP8', # 8-bit floating point
kv_cache_quant_algo='FP8',
)
# Compilation settings
builder_config = BuilderConfig(
quantization=quant_config,
hardware_compatibility='AMPERE_PLUS', # RTX 30/40, A100, H100
enable_fp8=True,
builder_opt=3,
)
# Convert and optimize
model.compile(builder_config)
model.save("llama-70b-trtllm")
Multi-GPU tensor parallelism:
# Tensor parallelism across 4 GPUs
python -m tensorrt_llm.commands.build \
--model_dir meta-llama/Llama-2-70b-hf \
--output_dir ./llama-70b-trtllm-4gpu \
--quantization fp8 \
--tensor_parallel 4 \
--hf_model_convert \
--max_batch_size 128 \
--max_input_len 4096 \
--max_new_tokens 1024
TensorRT-LLM uses custom inference runtime:
# Inference with TensorRT-LLM runtime
from tensorrt_llm.runtime import LLMEngine
engine = LLMEngine.from_dir(
"llama-70b-trtllm-4gpu",
temperature=0.8,
max_output_len=512,
)
# Streaming inference
for output in engine.generate_stream("Explain attention mechanism"):
print(output.content, end="", flush=True)
Comparison with vLLM:
| Metric | vLLM | TensorRT-LLM |
|---|---|---|
| Max throughput | High | Highest |
| Latency (p50) | ~50ms | ~20ms |
| Multi-GPU scaling | Good | Excellent |
| Model support | Broad | Optimized for Llama, GPT, Mistral |
| Configuration complexity | Medium | High |
| Update frequency | Weekly | Monthly |
EXERCISE
Compile the same model with both FP16 and FP8 quantization in TensorRT-LLM. Measure throughput and latency difference. Calculate the quality impact using perplexity evaluation.