02. Video Understanding
Video introduces time as a fundamental axis. Unlike images, videos require understanding what changes, what stays constant, and the causal relationships between events.
Video as a Tensor
Video is a 4D tensor: (time, height, width, channels). For computational purposes, we sample frames and process them sequentially. The key challenge is that naive frame-by-frame processing loses temporal continuity, while aggressive temporal compression loses spatial detail.
import av
import torch
from transformers import AutoModelForVideoClassification
# Load a video using PyAV
container = av.open("kitchen_activity.mp4")
# Inspect video properties
video_stream = container.streams.video[0]
print(f"Resolution: {video_stream.width}x{video_stream.height}")
print(f"FPS: {video_stream.average_rate}")
print(f"Duration: {video_stream.duration * video_stream.time_base}")
# Common failure: forgetting time_base conversion
# Duration is in packet units, not seconds!
actual_duration_seconds = video_stream.duration * float(video_stream.time_base)
Spatial vs. Temporal Information
Frames capture appearance: objects, colors, textures, spatial arrangements. Optical flow captures motion: direction, speed, acceleration. Many video models use both, either by stacking optical flow as additional channels or by processing RGB and flow separately.
# Optical flow extraction with OpenCV
import cv2
cap = cv2.VideoCapture("kitchen_activity.mp4")
ret, prev_frame = cap.read()
while cap.isOpened():
ret, curr_frame = cap.read()
if not ret:
break
prev_gray = cv2.cvtColor(prev_frame, cv2.COLOR_BGR2GRAY)
curr_gray = cv2.cvtColor(curr_frame, cv2.COLOR_BGR2GRAY)
# Compute Farneback optical flow
flow = cv2.calcOpticalFlowFarneback(
prev_gray, curr_gray,
None,
pyr_scale=0.5,
levels=3,
winsize=15,
iterations=3,
poly_n=5,
poly_sigma=1.2,
flags=0
)
# flow shape: (H, W, 2) - 2 channels for u, v displacement
prev_frame = curr_frame
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Write a Python function that extracts frames at 1 FPS from a video file using PyAV. Handle the edge case where video duration is less than 1 second. Test with a 0.5-second clip and verify behavior.