COURSE · BLD · I016

Model Optimization for Local Inference

Learn model optimization for local inference through RunLocalAI's practical lens: optimization, quantization, pruning and speculative decoding, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

18 chapters12hBuilder trackBy Fredoline Eruo
PREREQUISITES
  • B004
  • B012

Why this course matters

Model Optimization for Local Inference is for builders turning local models into working tools, agents and retrieval systems. It connects optimization, quantization, pruning and speculative decoding to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?

What you will be able to do

By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.

How to use this course

Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Why Optimize?, Quantization Formats Compared, GPTQ Quantization and AWQ Quantization and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.

CHAPTERS
  1. 01Why Optimize?Optimization turns impossible deployments into practical workflows—not through magic, but by addressing the specific bottlenecks that make local LLM inference infeasible.15 min
  2. 02Quantization Formats ComparedGGUF prioritizes accessibility across hardware; GPTQ and AWQ prioritize maximum efficiency on compatible NVIDIA GPUs.15 min
  3. 03GPTQ QuantizationGPTQ's per-column optimization preserves model capability far better than uniform quantization because it treats critical weights differently from redundant ones.15 min
  4. 04AWQ QuantizationAWQ exploits the insight that most weights matter little—quantize aggressively where it costs nothing, preserve precision where it counts.20 min
  5. 05GGUF QuantizationGGUF's design philosophy prioritizes accessibility—models work everywhere from laptops to servers—but achieving peak performance requires hardware-specific tuning.20 min
  6. 06Quantization Quality TradeoffsPerplexity provides a baseline, but task-specific evaluation matters more than aggregate metrics for real deployment decisions.15 min
  7. 07Speculative DecodingSpeculative decoding trades model quality for latency—the target model's acceptance threshold determines the balance between speed and accuracy.20 min
  8. 08Draft ModelsDraft model quality is the primary determinant of speculative decoding speedup—the target model quality determines acceptance threshold sensitivity.20 min
  9. 09FlashAttentionFlashAttention's algorithm exploits the fact that attention computation can be tiled—keeping intermediate results in fast on-chip memory eliminates the memory bandwidth bottleneck that limits standard attention.20 min
  10. 10PagedAttentionPagedAttention treats KV cache like virtual memory—allocating on demand and sharing across requests turns memory waste into memory efficiency.20 min
  11. 11vLLM OptimizationvLLM's configuration is workload-specific—throughput optimization and latency optimization require different settings, often contradictory ones.20 min
  12. 12TensorRT-LLMTensorRT-LLM's compilation step enables optimizations impossible in runtime—kernel fusion, operator fusion, and precision calibration combine to exceed runtime-only solutions.20 min
  13. 13Pruning: Structured vs UnstructuredStructured pruning's regularity determines hardware acceleration potential—unstructured sparsity often provides no speedup because hardware cannot exploit the pattern.20 min
  14. 14Attention SinkAttention sinks exploit model behavior to reduce memory requirements—models already ignore most of context, so formalizing this pattern enables much longer effective contexts.20 min
  15. 15KV Cache OptimizationKV cache dominates memory usage at long contexts. GQA provides immediate benefit; quantization provides additional gains; future research promises more aggressive compression.20 min
  16. 16Prompt CachingPrompt caching turns repetitive prefixes into free computation—the more uniform your workload, the more benefit from caching.20 min
  17. 17Batch OptimizationBatch optimization trades latency variance for throughput—understanding whether latency or throughput matters determines the right batching strategy.20 min
  18. 18End-to-End Optimization ProjectOptimization is iterative—each technique's impact compounds with the others. The final system performs 5-10x better than the baseline through cumulative improvements.25 min