Model Selection for Video — Advanced Multi-Modal Systems (Chapter 14)

Model selection for video tasks balances accuracy, latency, and computational cost. Video understanding typically involves temporal reasoning, which many architectures handle poorly if temporal dimensions are flattened or ignored.

Temporal Convolutional Networks (TCN) process frame sequences with dilated convolutions that capture multi-scale temporal patterns efficiently. The main advantage: constant inference time regardless of sequence length. The disadvantage: limited receptive field compared to attention mechanisms.

# TCN building block
class TemporalBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, dilation):
        super().__init__()
        padding = (kernel_size - 1) * dilation // 2
        self.conv1 = nn.Conv1d(in_channels, out_channels, 
                               kernel_size, padding=padding, 
                               dilation=dilation)
        self.bn1 = nn.BatchNorm1d(out_channels)
        self.conv2 = nn.Conv1d(out_channels, out_channels,
                               kernel_size, padding=padding,
                               dilation=dilation)
        self.bn2 = nn.BatchNorm1d(out_channels)
        self.relu = nn.ReLU()
        self.downsample = nn.Conv1d(in_channels, out_channels, 1) \
                         if in_channels != out_channels else None
    
    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        res = x if self.downsample is None else self.downsample(x)
        return self.relu(out + res)

Video Transformers process frames with self-attention across temporal and spatial dimensions. These models capture long-range dependencies but scale poorly with sequence length—attention complexity is O(n²) in sequence length. Sparse attention patterns and linear attention approximations mitigate this cost.

3D CNNs like I3D and SlowFast encode temporal information directly into convolution operations. SlowFast networks use two pathways: a slow pathway with low frame rate for semantic content and a fast pathway with high frame rate for motion. This architecture achieves strong performance with reasonable computational cost.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.