RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / TensorRT
Hardware & infrastructure

TensorRT

TensorRT is NVIDIA's SDK for optimizing and deploying deep learning models on NVIDIA GPUs. It performs graph optimization, kernel fusion, and precision calibration (FP16, INT8, INT4) to reduce latency and memory usage. Operators encounter TensorRT when they want to maximize inference throughput on RTX or data-center GPUs, typically by converting a model from PyTorch or ONNX into a TensorRT engine. The engine is hardware-specific: a build for an RTX 4090 won't run on an RTX 3060. TensorRT is not a runtime like llama.cpp; it's a compiler that produces a deployable engine.

Deeper dive

TensorRT works by taking a trained model (often in ONNX format) and applying a series of optimizations. First, it fuses layers—combining operations like convolution + batch normalization + ReLU into a single kernel—to reduce launch overhead. Second, it selects the fastest kernel for each layer given the target GPU architecture (e.g., Ampere, Ada Lovelace). Third, it can quantize weights and activations to lower precision (FP16, INT8, INT4) using a calibration dataset, trading accuracy for speed. The output is a serialized engine file (.plan) that can be loaded and run with the TensorRT runtime. For LLMs, TensorRT-LLM extends this with paged attention, in-flight batching, and multi-GPU support. Operators should note that building an engine takes minutes to hours, and the engine is tied to a specific CUDA version, TensorRT version, and GPU architecture. Rebuilding is required when any of those change.

Practical example

An operator running Llama 3.1 8B on an RTX 4090 (24 GB VRAM) might use TensorRT-LLM to build an INT4 engine. The build process converts the Hugging Face model to TensorRT format, calibrates with a small dataset, and produces a ~5 GB engine. Inference then runs at ~200 tok/s, compared to ~100 tok/s with llama.cpp Q4_K_M on the same GPU. However, the engine is not portable: moving it to an RTX 3060 (12 GB) would require a new build targeting that GPU's compute capability (8.6 vs 8.9).

Workflow example

In practice, an operator using vLLM with TensorRT-LLM backend would: 1) Install TensorRT-LLM and its dependencies. 2) Run trtllm-build --model_dir ./llama-8b --output_dir ./engine --dtype float16 --use_weight_only --weight_only_precision int4 to build the engine. 3) Serve with trtllm-serve --engine_dir ./engine. The build step is the main friction: it requires matching CUDA and TensorRT versions, and can fail if the GPU doesn't have enough VRAM for the intermediate representation. Operators often script this build as a one-time setup per GPU model.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →