12. Performance Profiling

Chapter 12 of 15 · 15 min

PyTorch Profiler

import torch
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    output = model.generate(input_ids, max_new_tokens=100)

# Print the top 10 operations by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Time Per Token

For inference, the relevant metric is time per generated token (not total time):

def benchmark_generation(model, tokenizer, prompt, num_runs=5):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    
    times_per_token = []
    for _ in range(num_runs):
        torch.cuda.synchronize()
        start = time.time()
        output = model.generate(input_ids, max_new_tokens=50)
        torch.cuda.synchronize()
        elapsed = time.time() - start
        tokens_generated = output.shape[1] - input_ids.shape[1]
        tpt = elapsed / tokens_generated
        times_per_token.append(tpt)
    
    return {
        "mean_tpt": sum(times_per_token) / len(times_per_token),
        "std_tpt": (sum((x - sum(times_per_token)/len(times_per_token))**2 for x in times_per_token) / len(times_per_token)) ** 0.5
    }

Memory Profiling

# Track memory allocation by operation
import torch
from contextlib import contextmanager

@contextmanager
def memory_tracker():
    torch.cuda.reset_peak_memory_stats()
    start = torch.cuda.memory_allocated()
    yield
    peak = torch.cuda.max_memory_allocated()
    print(f"Memory used: {(peak - start) / 1e9:.2f} GB")
    print(f"Peak memory: {peak / 1e9:.2f} GB")

with memory_tracker():
    model.generate(input_ids, max_new_tokens=100)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Profile a generation call. Identify the top 3 operations by time. If attention is not the top operation, investigate why (likely memory bandwidth or data transfer overhead).