04. Temporal Reasoning

Chapter 4 of 24 · 20 min

Temporal reasoning answers questions about sequence, duration, causality, and change over time. This chapter covers architectures designed to propagate information across the time dimension.

Recurrent Approaches

LSTMs and GRUs process sequences while maintaining hidden state. They process frame-by-frame, accumulating temporal context.

class LSTMVideoEncoder(torch.nn.Module):
    def __init__(self, frame_encoder, hidden_size=256):
        super().__init__()
        self.frame_encoder = frame_encoder  # e.g., ResNet
        self.lstm = torch.nn.LSTM(
            input_size=frame_encoder.output_dim,
            hidden_size=hidden_size,
            num_layers=2,
            batch_first=True,
            bidirectional=True
        )
    
    def forward(self, frames):
        # frames: (B, T, C, H, W)
        B, T, C, H, W = frames.shape
        
        # Flatten batch and time for frame encoding
        frame_features = self.frame_encoder(frames.view(B * T, C, H, W))
        frame_features = frame_features.view(B, T, -1)  # (B, T, frame_dim)
        
        # LSTM processes temporal sequence
        lstm_out, (h_n, c_n) = self.lstm(frame_features)
        
        # Return last hidden states from both directions
        return torch.cat([h_n[-2], h_n[-1]], dim=-1)  # (B, hidden_size * 2)

Attention-Based Temporal Modeling

Transformers attend across frames, enabling any frame to directly condition any other frame. This handles long-range dependencies better than RNNs but at quadratic cost in sequence length.

class TemporalTransformerEncoder(torch.nn.Module):
    def __init__(self, frame_dim, num_heads=8, num_layers=4):
        super().__init__()
        self.pos_embedding = PositionalEncoding2D(frame_dim)
        encoder_layer = torch.nn.TransformerEncoderLayer(
            d_model=frame_dim,
            nhead=num_heads,
            dim_feedforward=frame_dim * 4,
            batch_first=True
        )
        self.transformer = torch.nn.TransformerEncoder(
            encoder_layer,
            num_layers=num_layers
        )
    
    def forward(self, frame_features):
        # frame_features: (B, T, frame_dim)
        # Add positional encoding for temporal position
        seq_len = frame_features.size(1)
        pos_enc = self.pos_embedding(seq_len)
        x = frame_features + pos_enc
        
        # Self-attention across time
        return self.transformer(x)

Failure Mode: Temporal Aliasing

When events occur faster than your sampling rate, you miss them entirely. A 1-second action captured at 1 FPS might appear as two unrelated static states. This is temporal aliasing—high-frequency motion wrapped into low-frequency samples.

# Demonstrate temporal aliasing
import numpy as np

# Simulate 1-second motion (10 Hz oscillation)
t_high_res = np.linspace(0, 1, 1000)
motion_high_res = np.sin(2 * np.pi * 10 * t_high_res)

# Sample at 1 FPS (Nyquist would need 20+ FPS to capture 10 Hz)
t_low_res = np.linspace(0, 1, 2)  # Only 2 frames
motion_low_res = np.sin(2 * np.pi * 10 * t_low_res)

# The sampled signal appears as nearly static!
# High-res shows 10 full cycles; low-res shows almost nothing
EXERCISE

Given a 1-hour video sampled at 1 FPS (3600 frames), calculate the memory required to process it with a temporal transformer using full self-attention. With 512-dimensional embeddings and 8-byte floats, what is the attention matrix size? Propose an alternative architecture.