RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Fine-Tuning with LoRA and QLoRA
  6. /Ch. 7
Fine-Tuning with LoRA and QLoRA

07. 4-bit NormalFloat

Chapter 7 of 24 · 20 min
KEY INSIGHT

NF4 quantization uses non-uniform levels optimized for normal weight distributions, concentrating precision where weights cluster near zero.

Standard integer quantization maps floating-point values to discrete integer representations. For 4-bit quantization, this provides only 16 discrete values to represent the entire range of model weights. The choice of how to distribute these 16 values critically affects quantization accuracy.

Neural network weights typically follow a roughly normal distribution centered near zero. Standard uniform quantization wastes precision where most weights cluster (near zero) to accommodate rare extreme values. The NF4 (4-bit NormalFloat) format addresses this by using non-uniform quantization levels optimized for normal distributions.

NF4 divides the quantization range into 16 levels with spacing determined by the quantile function of the normal distribution. More levels concentrate near zero where the probability density is highest, providing finer precision exactly where it's most valuable. This results in significantly lower quantization error for normally distributed weights compared to uniform quantization.

The NF4 format also includes a block-wise quantization strategy where each block of parameters (typically 64 or 32 parameters) has its own quantization constants. This allows small groups of weights with different ranges to use their local minimum and maximum values for normalization, preventing any single outlier group from forcing coarse quantization across the entire model.

Double quantization extends memory savings by quantizing the quantization constants themselves. Instead of storing block quantization constants in 32-bit floating point, they are further quantized to 8-bit or even 4-bit. The small additional error introduced is negligible compared to the memory saved.

The bitsandbytes library implements NF4 quantization with the necessary dequantization kernels optimized for GPU execution. The CUDA kernels handle on-the-fly dequantization during forward and backward passes with minimal performance overhead compared to the theoretical cost of quantization operations.

EXERCISE

Compare quantization error between uniform 4-bit and NF4 for a sample of normally distributed values. Calculate mean squared error for each approach.

# nf4_quantization_demo.py
import torch
import numpy as np
from scipy import stats

def uniform_quantize(tensor: torch.Tensor, bits: int, block_size: int = 64):
    """Quantize tensor using uniform quantization with block-wise scaling."""
    num_levels = 2 ** bits
    quantized_blocks = []
    scales = []
    
    for i in range(0, tensor.numel(), block_size):
        block = tensor[i:i+block_size]
        if block.numel() == 0:
            continue
        
        scale = (block.max() - block.min()) / (num_levels - 1)
        if scale > 0:
            quantized = torch.round((block - block.min()) / scale)
            dequantized = quantized * scale + block.min()
        else:
            quantized = torch.zeros_like(block)
            dequantized = block
        
        quantized_blocks.append(quantized)
        scales.append((block.min().item(), scale.item()))
    
    return torch.cat(quantized_blocks), scales

def nf4_quantize(tensor: torch.Tensor, block_size: int = 64):
    """Quantize tensor using NF4 (simplified demonstration)."""
    # Generate NF4 levels using quantile function
    num_levels = 16
    levels = torch.tensor([
        stats.norm.ppf((i + 0.5) / num_levels) 
        for i in range(num_levels)
    ], dtype=tensor.dtype)
    
    quantized_blocks = []
    
    for i in range(0, tensor.numel(), block_size):
        block = tensor[i:i+block_size]
        if block.numel() == 0:
            continue
        
        # Normalize block to zero mean, unit variance
        block_mean = block.mean()
        block_std = block.std()
        if block_std > 0:
            normalized = (block - block_mean) / block_std
        else:
            normalized = block - block_mean
        
        # Find nearest NF4 level
        distances = torch.cdist(normalized.unsqueeze(1), levels.unsqueeze(0))
        indices = distances.argmin(dim=1)
        
        # Dequantize
        dequantized = levels[indices] * block_std + block_mean
        quantized_blocks.append(indices)
    
    return torch.cat(quantized_blocks), levels

# Test on random normal data
torch.manual_seed(42)
test_weights = torch.randn(10000)

# Uniform quantization
uniform_q, _ = uniform_quantize(test_weights, bits=4)
# NF4 quantization  
nf4_q, _ = nf4_quantize(test_weights)

print(f"Test data: {test_weights.numel()} parameters")
print(f"Distribution: mean={test_weights.mean():.4f}, std={test_weights.std():.4f}")

# Note: Full NF4 requires specific dequantization for accurate comparison
# This simplified version demonstrates the conceptual difference
← Chapter 6
QLoRA: Quantized LoRA
Chapter 8 →
Dataset Preparation