12. Performance Profiling
PyTorch Profiler
import torch
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
output = model.generate(input_ids, max_new_tokens=100)
# Print the top 10 operations by CUDA time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Time Per Token
For inference, the relevant metric is time per generated token (not total time):
def benchmark_generation(model, tokenizer, prompt, num_runs=5):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
times_per_token = []
for _ in range(num_runs):
torch.cuda.synchronize()
start = time.time()
output = model.generate(input_ids, max_new_tokens=50)
torch.cuda.synchronize()
elapsed = time.time() - start
tokens_generated = output.shape[1] - input_ids.shape[1]
tpt = elapsed / tokens_generated
times_per_token.append(tpt)
return {
"mean_tpt": sum(times_per_token) / len(times_per_token),
"std_tpt": (sum((x - sum(times_per_token)/len(times_per_token))**2 for x in times_per_token) / len(times_per_token)) ** 0.5
}
Memory Profiling
# Track memory allocation by operation
import torch
from contextlib import contextmanager
@contextmanager
def memory_tracker():
torch.cuda.reset_peak_memory_stats()
start = torch.cuda.memory_allocated()
yield
peak = torch.cuda.max_memory_allocated()
print(f"Memory used: {(peak - start) / 1e9:.2f} GB")
print(f"Peak memory: {peak / 1e9:.2f} GB")
with memory_tracker():
model.generate(input_ids, max_new_tokens=100)
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Profile a generation call. Identify the top 3 operations by time. If attention is not the top operation, investigate why (likely memory bandwidth or data transfer overhead).