04. Temporal Reasoning
Temporal reasoning answers questions about sequence, duration, causality, and change over time. This chapter covers architectures designed to propagate information across the time dimension.
Recurrent Approaches
LSTMs and GRUs process sequences while maintaining hidden state. They process frame-by-frame, accumulating temporal context.
class LSTMVideoEncoder(torch.nn.Module):
def __init__(self, frame_encoder, hidden_size=256):
super().__init__()
self.frame_encoder = frame_encoder # e.g., ResNet
self.lstm = torch.nn.LSTM(
input_size=frame_encoder.output_dim,
hidden_size=hidden_size,
num_layers=2,
batch_first=True,
bidirectional=True
)
def forward(self, frames):
# frames: (B, T, C, H, W)
B, T, C, H, W = frames.shape
# Flatten batch and time for frame encoding
frame_features = self.frame_encoder(frames.view(B * T, C, H, W))
frame_features = frame_features.view(B, T, -1) # (B, T, frame_dim)
# LSTM processes temporal sequence
lstm_out, (h_n, c_n) = self.lstm(frame_features)
# Return last hidden states from both directions
return torch.cat([h_n[-2], h_n[-1]], dim=-1) # (B, hidden_size * 2)
Attention-Based Temporal Modeling
Transformers attend across frames, enabling any frame to directly condition any other frame. This handles long-range dependencies better than RNNs but at quadratic cost in sequence length.
class TemporalTransformerEncoder(torch.nn.Module):
def __init__(self, frame_dim, num_heads=8, num_layers=4):
super().__init__()
self.pos_embedding = PositionalEncoding2D(frame_dim)
encoder_layer = torch.nn.TransformerEncoderLayer(
d_model=frame_dim,
nhead=num_heads,
dim_feedforward=frame_dim * 4,
batch_first=True
)
self.transformer = torch.nn.TransformerEncoder(
encoder_layer,
num_layers=num_layers
)
def forward(self, frame_features):
# frame_features: (B, T, frame_dim)
# Add positional encoding for temporal position
seq_len = frame_features.size(1)
pos_enc = self.pos_embedding(seq_len)
x = frame_features + pos_enc
# Self-attention across time
return self.transformer(x)
Failure Mode: Temporal Aliasing
When events occur faster than your sampling rate, you miss them entirely. A 1-second action captured at 1 FPS might appear as two unrelated static states. This is temporal aliasing—high-frequency motion wrapped into low-frequency samples.
# Demonstrate temporal aliasing
import numpy as np
# Simulate 1-second motion (10 Hz oscillation)
t_high_res = np.linspace(0, 1, 1000)
motion_high_res = np.sin(2 * np.pi * 10 * t_high_res)
# Sample at 1 FPS (Nyquist would need 20+ FPS to capture 10 Hz)
t_low_res = np.linspace(0, 1, 2) # Only 2 frames
motion_low_res = np.sin(2 * np.pi * 10 * t_low_res)
# The sampled signal appears as nearly static!
# High-res shows 10 full cycles; low-res shows almost nothing
Given a 1-hour video sampled at 1 FPS (3600 frames), calculate the memory required to process it with a temporal transformer using full self-attention. With 512-dimensional embeddings and 8-byte floats, what is the attention matrix size? Propose an alternative architecture.