04. Rank Selection

Chapter 4 of 24 · 15 min

The rank parameter r in LoRA controls the expressive capacity of the adaptation. Higher ranks can capture more complex behavioral modifications but require more parameters and training data. Lower ranks are more parameter-efficient but may underfit the target behavior.

The relationship between rank and performance follows a pattern observed across many model sizes and tasks: modest ranks often achieve surprisingly good results. Ranks of 4-16 frequently capture sufficient variation for many classification and instruction-following tasks. Complex tasks requiring nuanced stylistic adaptation may benefit from higher ranks of 32-64.

Memory consumption scales linearly with rank for the LoRA parameters themselves, but optimizer states also grow with rank. In mixed-precision training, the optimizer maintains 32-bit states for all trainable parameters regardless of their precision. This means doubling rank roughly doubles the optimizer memory footprint for LoRA parameters.

A practical approach involves starting with a conservative rank (8 or 16), evaluating performance, then increasing if underfitting is observed. For production systems, experimentation with rank variation across a few values (4, 8, 16, 32) provides data for informed selection based on the specific task.

QLoRA extends this by quantizing the frozen base model weights to 4-bit while maintaining LoRA in higher precision (typically bf16). This decouples rank selection from base model precision, allowing rank decisions based purely on task requirements rather than hardware constraints.

The scaling factor alpha (α) in LoRA formulation also deserves attention. This parameter controls the magnitude of the adaptation's contribution to the output. Common practice sets alpha equal to rank, creating a normalized initialization where the adaptation starts at zero and scales linearly with learning rate.

EXERCISE

Create a small experiment comparing rank 4, 8, 16, and 32 on a binary classification task. Measure accuracy and training time for each configuration. Observe diminishing returns at higher ranks.

# rank_comparison.py
from dataclasses import dataclass
from typing import List

@dataclass
class RankExperiment:
    rank: int
    alpha: int
    trainable_params: int
    accuracy: float
    training_time_seconds: float

def estimate_lora_params(
    layer_dim: int,
    rank: int,
    num_layers: int = 32
) -> int:
    """Estimate trainable LoRA parameters per layer."""
    # LoRA: W + BA where B is d×r, A is r×k
    return 2 * rank * layer_dim

def run_rank_comparison(
    layer_dim: int,
    ranks: List[int],
    num_layers: int = 32
) -> List[RankExperiment]:
    results = []
    for r in ranks:
        params = estimate_lora_params(layer_dim, r, num_layers)
        results.append(RankExperiment(
            rank=r,
            alpha=r,  # common convention
            trainable_params=params,
            accuracy=0.0,  # would be filled by actual training
            training_time_seconds=0.0
        ))
    return results

# Compare memory requirements across ranks
def estimate_training_memory(
    rank: int,
    layer_dim: int = 4096,
    num_layers: int = 32,
    precision: str = "bf16"
) -> dict:
    """Estimate memory for LoRA training."""
    lora_params = 2 * rank * layer_dim * num_layers
    
    # Optimizer states (32-bit for trainable params)
    optimizer_bytes = lora_params * 4
    
    # Activations depend on batch size, sequence length
    batch_size = 4
    seq_length = 512
    
    return {
        "rank": rank,
        "lora_params": lora_params,
        "optimizer_mb": optimizer_bytes / 1e6,
        "compression_vs_full": (layer_dim * layer_dim * num_layers) / lora_params
    }

for r in [4, 8, 16, 32, 64]:
    mem = estimate_training_memory(r)
    print(f"Rank {r:2d}: {mem['optimizer_mb']:6.2f} MB optimizer, "
          f"{mem['compression_vs_full']:.0f}x compression")