Model Optimization for Local Inference
Learn model optimization for local inference through RunLocalAI's practical lens: optimization, quantization, pruning and speculative decoding, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.
- B004
- B012
Why this course matters
Model Optimization for Local Inference is for builders turning local models into working tools, agents and retrieval systems. It connects optimization, quantization, pruning and speculative decoding to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?
What you will be able to do
By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.
How to use this course
Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Why Optimize?, Quantization Formats Compared, GPTQ Quantization and AWQ Quantization and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.
- 01Why Optimize?Optimization turns impossible deployments into practical workflows—not through magic, but by addressing the specific bottlenecks that make local LLM inference infeasible.15 min
- 02Quantization Formats ComparedGGUF prioritizes accessibility across hardware; GPTQ and AWQ prioritize maximum efficiency on compatible NVIDIA GPUs.15 min
- 03GPTQ QuantizationGPTQ's per-column optimization preserves model capability far better than uniform quantization because it treats critical weights differently from redundant ones.15 min
- 04AWQ QuantizationAWQ exploits the insight that most weights matter little—quantize aggressively where it costs nothing, preserve precision where it counts.20 min
- 05GGUF QuantizationGGUF's design philosophy prioritizes accessibility—models work everywhere from laptops to servers—but achieving peak performance requires hardware-specific tuning.20 min
- 06Quantization Quality TradeoffsPerplexity provides a baseline, but task-specific evaluation matters more than aggregate metrics for real deployment decisions.15 min
- 07Speculative DecodingSpeculative decoding trades model quality for latency—the target model's acceptance threshold determines the balance between speed and accuracy.20 min
- 08Draft ModelsDraft model quality is the primary determinant of speculative decoding speedup—the target model quality determines acceptance threshold sensitivity.20 min
- 09FlashAttentionFlashAttention's algorithm exploits the fact that attention computation can be tiled—keeping intermediate results in fast on-chip memory eliminates the memory bandwidth bottleneck that limits standard attention.20 min
- 10PagedAttentionPagedAttention treats KV cache like virtual memory—allocating on demand and sharing across requests turns memory waste into memory efficiency.20 min
- 11vLLM OptimizationvLLM's configuration is workload-specific—throughput optimization and latency optimization require different settings, often contradictory ones.20 min
- 12TensorRT-LLMTensorRT-LLM's compilation step enables optimizations impossible in runtime—kernel fusion, operator fusion, and precision calibration combine to exceed runtime-only solutions.20 min
- 13Pruning: Structured vs UnstructuredStructured pruning's regularity determines hardware acceleration potential—unstructured sparsity often provides no speedup because hardware cannot exploit the pattern.20 min
- 14Attention SinkAttention sinks exploit model behavior to reduce memory requirements—models already ignore most of context, so formalizing this pattern enables much longer effective contexts.20 min
- 15KV Cache OptimizationKV cache dominates memory usage at long contexts. GQA provides immediate benefit; quantization provides additional gains; future research promises more aggressive compression.20 min
- 16Prompt CachingPrompt caching turns repetitive prefixes into free computation—the more uniform your workload, the more benefit from caching.20 min
- 17Batch OptimizationBatch optimization trades latency variance for throughput—understanding whether latency or throughput matters determines the right batching strategy.20 min
- 18End-to-End Optimization ProjectOptimization is iterative—each technique's impact compounds with the others. The final system performs 5-10x better than the baseline through cumulative improvements.25 min