Video Understanding — Advanced Multi-Modal Systems (Chapter 2)

Video introduces time as a fundamental axis. Unlike images, videos require understanding what changes, what stays constant, and the causal relationships between events.

Video as a Tensor

Video is a 4D tensor: (time, height, width, channels). For computational purposes, we sample frames and process them sequentially. The key challenge is that naive frame-by-frame processing loses temporal continuity, while aggressive temporal compression loses spatial detail.

import av
import torch
from transformers import AutoModelForVideoClassification

# Load a video using PyAV
container = av.open("kitchen_activity.mp4")

# Inspect video properties
video_stream = container.streams.video[0]
print(f"Resolution: {video_stream.width}x{video_stream.height}")
print(f"FPS: {video_stream.average_rate}")
print(f"Duration: {video_stream.duration * video_stream.time_base}")

# Common failure: forgetting time_base conversion
# Duration is in packet units, not seconds!
actual_duration_seconds = video_stream.duration * float(video_stream.time_base)

Spatial vs. Temporal Information

Frames capture appearance: objects, colors, textures, spatial arrangements. Optical flow captures motion: direction, speed, acceleration. Many video models use both, either by stacking optical flow as additional channels or by processing RGB and flow separately.

# Optical flow extraction with OpenCV
import cv2

cap = cv2.VideoCapture("kitchen_activity.mp4")
ret, prev_frame = cap.read()

while cap.isOpened():
    ret, curr_frame = cap.read()
    if not ret:
        break
    
    prev_gray = cv2.cvtColor(prev_frame, cv2.COLOR_BGR2GRAY)
    curr_gray = cv2.cvtColor(curr_frame, cv2.COLOR_BGR2GRAY)
    
    # Compute Farneback optical flow
    flow = cv2.calcOpticalFlowFarneback(
        prev_gray, curr_gray,
        None,
        pyr_scale=0.5,
        levels=3,
        winsize=15,
        iterations=3,
        poly_n=5,
        poly_sigma=1.2,
        flags=0
    )
    
    # flow shape: (H, W, 2) - 2 channels for u, v displacement
    prev_frame = curr_frame

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.