What's the actual stack from CUDA to my chat UI? Where does each piece fit?

Reviewed May 15, 20262 min read

inference-stackcudatritonvllmtensorrt-llmkernels

The answer

One paragraph. No hedging beyond what the data actually warrants.

5 layers. Each owns a specific concern. Most operators only interact with the top two.

┌──────────────────────────────────────────────────────────────────┐
│ Layer 5 — App                                                    │
│ (Aider, Cline, Open WebUI, Khoj, your custom Python script)      │
│ Concern: "I want to chat / code / search my docs"                │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                       OpenAI-compatible HTTP
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 4 — Serving runtime / API                                  │
│ (Ollama, vLLM, llama-cpp-server, LM Studio, MLX-server, TGI)     │
│ Concern: HTTP endpoint, request routing, batching, KV cache      │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                Library calls (C++, Python)
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 3 — Inference engine                                       │
│ (llama.cpp, TensorRT-LLM, MLX, ExLlamaV2, vLLM core)             │
│ Concern: Model loading, attention math, sampler, quantization    │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                  Kernels (.cu, .cpp, .mlir)
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 2 — Compute kernels                                        │
│ (Triton, CUTLASS, ROCm HIP, Metal MPS, MLX kernels)              │
│ Concern: Matmul, attention, fused ops on the actual hardware     │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                          Hardware API
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 1 — Hardware + driver                                      │
│ (CUDA driver, ROCm, Metal, NPU drivers)                          │
│ Concern: Tensor cores, memory bus, scheduler, power management   │
└──────────────────────────────────────────────────────────────────┘

Most operators only interact with Layers 4 and 5. You install Ollama (Layer 4), point Open WebUI at it (Layer 5), and chat. The other 3 layers are abstracted.

When you'd drop a layer down:

Drop to Layer 3 when you need a feature Ollama doesn't expose yet (e.g., NVFP4 support before Ollama ships it). Use llama.cpp directly with the right flags.
Drop to Layer 2 when you're writing custom kernels for a niche operation (e.g., a sparse attention variant) or doing performance tuning. Triton is the cleanest way to write custom CUDA kernels from Python.
Drop to Layer 1 when you're building your own inference engine or doing low-level GPU work. SASS / PTX reverse-engineering and the CUDA Oxide (Rust → CUDA) work happens here.

The "when do you drop below TensorRT?" question from r/CUDA:

For inference serving in production: TensorRT-LLM (Layer 3) is usually the right floor. It compiles models down to optimized kernel stacks. Going lower (Layer 2 Triton) only makes sense for novel operations.
For research / kernel development: Triton (Layer 2) is the right level. Tracing through what TensorRT-LLM generates and beating it for your specific shape is real work that pays off in 20-30% throughput sometimes.
For consumer / hobbyist: stay at Layer 3-4. llama.cpp + Ollama covers 95% of what you'd want without dropping below.

The vLLM is special case: vLLM crosses Layer 2-3-4 boundaries. Their "PagedAttention" is a Layer 2 kernel innovation, served via Layer 3-4 abstractions. That's why vLLM is faster than llama.cpp under load — they have novel Layer 2 work that llama.cpp hasn't matched.

Operator decision rule:

If you can build your app entirely at Layer 5 + 4 (Ollama + an OpenAI-shaped client), stay there. Don't drop unless you're hitting a wall.
If you need a feature that's coming "soon" to a Layer 4 runtime, dropping to Layer 3 (llama.cpp directly or vLLM directly) usually gets it sooner.

Explore the numbers for your specific stack

Open the full runtimes (/tools) directory →

Editorial pages for every Layer-3 and Layer-4 entry (llama.cpp, vLLM, Ollama, MLX, TensorRT-LLM, ExLlamaV2, LM Studio).

Where we got the numbers

Layer abstraction model is standard ML systems literature (Triton paper, vLLM paper, TensorRT-LLM docs). PagedAttention as Layer-2 innovation: Kwon et al. 2023 SOSP paper. The 'when do you drop below TensorRT' question: r/CUDA May 2026 discussion.

Also see

The Layer-4 decision rule →

Ollama vs llama.cpp vs vLLM — when to pick each one, in plain English.

Hosting vs serving →

The Layer-4 trade-off: Ollama is 'hosting', vLLM is 'serving'.

NVFP4 (Layer 2-3 quantization) →

How NVFP4 sits across layers — kernel hardware + format support + runtime adoption.

Inference runtimes map →

The full ecosystem cartography for Layer 3-4 picks.

What's the actual stack from CUDA to my chat UI? Where does each piece fit?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread