27. Optimizing Python Code
Profiling showed you where time goes. Now what? The core optimization tension in Python: readability versus performance. Always optimize for clarity first, then optimize the hot paths that profiling identifies.
Common optimizations for AI pipelines:
# SLOW: Python loop for numerical computation
def slow_square_sum(values):
total = 0
for v in values:
total += v * v
return total
# FAST: Use numpy vectorized operations
import numpy as np
def fast_square_sum(values):
arr = np.array(values)
return float(np.sum(arr ** 2))
# Benchmark
import timeit
values = list(range(100000))
slow_time = timeit.timeit(lambda: slow_square_sum(values), number=10)
fast_time = timeit.timeit(lambda: fast_square_sum(values), number=10)
print(f"Slow (loop): {slow_time:.4f}s")
print(f"Fast (numpy): {fast_time:.4f}s") # Expect 10-100x speedup
# SLOW: String concatenation in loop
def slow_concat(items):
result = ""
for item in items:
result += item + ", "
return result
# FAST: Join
def fast_concat(items):
return ", ".join(items)
List comprehensions are faster than explicit loops (they're optimized C code). Generators (yield) save memory for large datasets. functools.lru_cache memoizes expensive function calls:
from functools import lru_cache
@lru_cache(maxsize=1024)
def expensive_embedding(text: str) -> list[float]:
"""Simulated expensive embedding computation."""
# In reality, this calls a slow API or model
return [hash(text + str(i)) % 1000 / 1000 for i in range(10)]
# Second call with same text hits cache
result1 = expensive_embedding("hello world")
result2 = expensive_embedding("hello world") # Instant, from cache
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Create a function that computes a rolling average over a list (each output is the mean of the current element plus the previous N-1 elements). Implement it: (1) with a Python loop, (2) using numpy convolution. Benchmark both with a list of 100,000 floats and window size 100. Show the speedup.