What's the actual stack from CUDA to my chat UI? Where does each piece fit?

Reviewed May 15, 20262 min read
inference-stackcudatritonvllmtensorrt-llmkernels

The answer

One paragraph. No hedging beyond what the data actually warrants.

5 layers. Each owns a specific concern. Most operators only interact with the top two.

┌──────────────────────────────────────────────────────────────────┐
│ Layer 5 — App                                                    │
│ (Aider, Cline, Open WebUI, Khoj, your custom Python script)      │
│ Concern: "I want to chat / code / search my docs"                │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                       OpenAI-compatible HTTP
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 4 — Serving runtime / API                                  │
│ (Ollama, vLLM, llama-cpp-server, LM Studio, MLX-server, TGI)     │
│ Concern: HTTP endpoint, request routing, batching, KV cache      │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                Library calls (C++, Python)
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 3 — Inference engine                                       │
│ (llama.cpp, TensorRT-LLM, MLX, ExLlamaV2, vLLM core)             │
│ Concern: Model loading, attention math, sampler, quantization    │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                  Kernels (.cu, .cpp, .mlir)
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 2 — Compute kernels                                        │
│ (Triton, CUTLASS, ROCm HIP, Metal MPS, MLX kernels)              │
│ Concern: Matmul, attention, fused ops on the actual hardware     │
└────────────────────────────────────────────────────────────────┬─┘
                                                                 │
                                                          Hardware API
                                                                 │
┌────────────────────────────────────────────────────────────────┴─┐
│ Layer 1 — Hardware + driver                                      │
│ (CUDA driver, ROCm, Metal, NPU drivers)                          │
│ Concern: Tensor cores, memory bus, scheduler, power management   │
└──────────────────────────────────────────────────────────────────┘

Most operators only interact with Layers 4 and 5. You install Ollama (Layer 4), point Open WebUI at it (Layer 5), and chat. The other 3 layers are abstracted.

When you'd drop a layer down:

  • Drop to Layer 3 when you need a feature Ollama doesn't expose yet (e.g., NVFP4 support before Ollama ships it). Use llama.cpp directly with the right flags.

  • Drop to Layer 2 when you're writing custom kernels for a niche operation (e.g., a sparse attention variant) or doing performance tuning. Triton is the cleanest way to write custom CUDA kernels from Python.

  • Drop to Layer 1 when you're building your own inference engine or doing low-level GPU work. SASS / PTX reverse-engineering and the CUDA Oxide (Rust → CUDA) work happens here.

The "when do you drop below TensorRT?" question from r/CUDA:

  • For inference serving in production: TensorRT-LLM (Layer 3) is usually the right floor. It compiles models down to optimized kernel stacks. Going lower (Layer 2 Triton) only makes sense for novel operations.
  • For research / kernel development: Triton (Layer 2) is the right level. Tracing through what TensorRT-LLM generates and beating it for your specific shape is real work that pays off in 20-30% throughput sometimes.
  • For consumer / hobbyist: stay at Layer 3-4. llama.cpp + Ollama covers 95% of what you'd want without dropping below.

The vLLM is special case: vLLM crosses Layer 2-3-4 boundaries. Their "PagedAttention" is a Layer 2 kernel innovation, served via Layer 3-4 abstractions. That's why vLLM is faster than llama.cpp under load — they have novel Layer 2 work that llama.cpp hasn't matched.

Operator decision rule:

  • If you can build your app entirely at Layer 5 + 4 (Ollama + an OpenAI-shaped client), stay there. Don't drop unless you're hitting a wall.
  • If you need a feature that's coming "soon" to a Layer 4 runtime, dropping to Layer 3 (llama.cpp directly or vLLM directly) usually gets it sooner.

Where we got the numbers

Layer abstraction model is standard ML systems literature (Triton paper, vLLM paper, TensorRT-LLM docs). PagedAttention as Layer-2 innovation: Kwon et al. 2023 SOSP paper. The 'when do you drop below TensorRT' question: r/CUDA May 2026 discussion.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.