Custom LLM Architecture Design
Learn custom llm architecture design through RunLocalAI's practical lens: architecture, transformers, attention and pytorch, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.
- I003
- I017
Why this course matters
Custom LLM Architecture Design is for operators making local AI reliable, measurable and cheaper to run. It connects architecture, transformers, attention, pytorch and custom to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?
What you will be able to do
By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.
How to use this course
Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Transformer Architecture Review, Attention Mechanisms, Multi-Head Attention and FlashAttention Implementation and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.
- 01Transformer Architecture ReviewThe transformer scales predictably with depth (more layers) and width (larger d_model), but the quadratic attention complexity fundamentally limits context length. Architectural innovations like FlashAttention and state space models address this limitation.15 min
- 02Attention MechanismsAttention's power comes from differentiable, learned alignment between positions. The computation is expensive, but the O(n²) pattern is unavoidable in full attention. PyTorch's built-in `F.scaled_dot_product_attention` optimizes this on modern hardware.15 min
- 03Multi-Head AttentionMulti-head attention enables parallel representation learning. With 32 heads, the model can track 32 different types of relationships—syntax, semantics, coreference, spatial reasoning—simultaneously rather than forcing one attention matrix to capture everything.15 min
- 04FlashAttention ImplementationFlashAttention solves attention's memory bottleneck by computing in tiles that fit in fast GPU SRAM. This reduces memory from O(N²) to O(N) at the cost of slightly slower compute—a worthwhile trade-off when memory is the limiting factor.15 min
- 05Rotary Position EmbeddingRoPE decouples position encoding from attention weights. Because the dot product only depends on relative position, models with RoPE generalize better to longer contexts than models with learned positional embeddings.20 min
- 06SwiGLU ActivationSwiGLU's gating mechanism allows the network to selectively pass information through the FFN, mimicking a learned "memory" at each token. This contrasts with ReLU's threshold behavior, enabling smoother gradient flow and better gradient scaling across layers.15 min
- 07Transformer Block DesignPre-norm architecture with RMSNorm is the modern standard. RMSNorm omits the mean centering (which is computationally expensive and sometimes detrimental), focusing on root mean square for stability with fewer parameters.15 min
- 08Mixture of ExpertsMoE enables scaling model parameters without scaling compute per token. A 16-expert MoE with top_k=2 has 8x more parameters than a dense model but only 2x the per-token computation—assuming perfect load balancing.15 min
- 09Expert RoutingRouting is fundamentally a routing problem. The router learns to send semantically similar inputs to semantically similar experts—but because routing happens per-token, there's no mechanism to ensure global balance. Auxiliary losses are essential for effective MoE training.15 min
- 10Load BalancingLoad balancing is a multi-objective optimization problem. The router must simultaneously (1) select good experts for each token and (2) distribute tokens evenly across experts. These objectives occasionally conflict; balancing how much each matters is crucial.15 min
- 11Mamba State Space ModelMamba's "selective" property is crucial: unlike standard SSMs where A, B, C are constant, Mamba makes these parameters input-dependent. This allows the model to selectively ignore or emphasize state based on content, enabling content-aware filtering.15 min
- 12Selective State SpaceThe "selective" mechanism gives Mamba an LSTM-like ability to forget irrelevant history. If Δ for a token is very small, the state decays (deltaA ≈ 0) and the token's information is excluded. This content-aware retention or decay is impossible in standard attention or SSMs.20 min
- 13Comparing ArchitecturesArchitecture decisions made at design time cannot be undone. GQA typically saves 60-70% KV cache memory with negligible quality impact, making it the default choice for production models above 7B parameters.25 min
- 14Custom Attention PatternsCustom attention patterns trade generality for efficiency. The key is ensuring the pattern matches the data structure—sparse for text with long-range dependencies, stride-based for spatial data, block-based for very long sequences.25 min
- 15Grouped Query AttentionGQA provides the best quality/memory tradeoff for production models. The standard configuration is 8 KV heads for models up to 70B parameters, scaling to 4 KV heads for larger models. The quality degradation from reducing KV heads below 4 is typically unacceptable.25 min
- 16Sliding Window AttentionSliding window attention enables processing of long sequences with constant memory per token. The key is ensuring the effective receptive field grows with depth—a stack of 32 layers with window_size=512 yields an effective receptive field of ~16K tokens even though each layer only attends locally.25 min
- 17Training StabilityTraining stability is not a one-time configuration but a continuous monitoring process. Implement gradient monitoring, checkpoint saving on anomalies, and conservative initialization. A 7B model trained for 100K steps represents thousands of dollars—invest in stability infrastructure.25 min
- 18Scaling LawsScaling laws provide quantitative guidance for resource allocation. The Chinchilla rule (20 tokens per parameter) remains the best general guideline, but custom architectures may have different optimal ratios due to architectural choices affecting parameter efficiency.30 min
- 19Architecture PrototypingA prototype that doesn't represent production conditions is worse than useless—it gives false confidence. Invest in prototypes that mirror the target scale: sequence lengths, batch sizes, and training duration that catch failure modes before they destroy full-scale training runs.25 min
- 20Benchmarking Custom ArchitectureBenchmarking requires statistical rigor. Throughput varies by 10-20% due to GPU state; use 20+ warmup iterations and report variance. Quality benchmarks need multiple runs with temperature; single runs are noisy for generative tasks.30 min
- 21Model SizingModel sizing requires balancing multiple constraints: target quality, compute budget, memory for inference, and deployment requirements. Use parameter counting tools that include all components—embeddings, layer norms, and LM heads—to avoid surprises at scale.30 min
- 22Training InfrastructureCustom architectures require custom infrastructure. FSDP handles sharding, but custom attention patterns may need specialized kernels. Invest in gradient monitoring, checkpoint management, and training diagnostics from day one—a 100B parameter training run represents weeks of compute and millions of dollars.30 min
- 23Architecture DocumentationDocumentation is a first-class architectural component. Decisions without documentation are lost knowledge. Use ADR (Architecture Decision Records), self-documenting configurations, and inline code documentation to preserve institutional knowledge.30 min
- 24Custom Architecture ProjectThis project demonstrates the complete workflow from specification to validation. Each decision (sliding window for layers 0-15, dilated for 16-27, full for 28-31) reflects tradeoffs documented in ADRs. The architecture succeeds because decisions are explicit and validated.35 min