Video LLMs — Advanced Multi-Modal Systems (Chapter 5)

Video LLMs extend language model capabilities to video inputs. They answer questions about video content, summarize events, and reason about actions—all while grounding responses in the visual timeline.

Architecture Patterns

Most video LLMs follow a similar pattern: encode video into tokens, project into language model space, and attend with text tokens.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

class VideoLLaMA(torch.nn.Module):
    def __init__(self, llm_name="meta-llama/Llama-3.2-3B"):
        super().__init__()
        self.llm = AutoModelForCausalLM.from_pretrained(llm_name)
        self.video_encoder = VideoEncoder()  # Your video encoder
        self.video_projector = torch.nn.Linear(
            self.video_encoder.out_dim,
            self.llm.config.hidden_size
        )
    
    def forward(self, video_frames, input_ids, attention_mask=None):
        # Encode video: (B, T, C, H, W) -> (B, T, video_dim)
        video_features = self.video_encoder(video_frames)
        
        # Project to LLM embedding space
        video_tokens = self.video_projector(video_features)
        
        # Special video token ID
        VIDEO_TOKEN_ID = 128256
        
        # Replace VIDEO_TOKEN_ID in input with actual video embeddings
        text_embeddings = self.llm.get_input_embeddings()(input_ids)
        
        # Find positions with video tokens
        video_mask = (input_ids == VIDEO_TOKEN_ID)
        text_embeddings[video_mask] = video_tokens
        
        # Forward through LLM
        return self.llm(
            inputs_embeds=text_embeddings,
            attention_mask=attention_mask
        )

Video Tokenization Strategies

How you convert video to tokens affects what the LLM can perceive. Frame sampling rate, spatial resolution, and token count all matter.

class FrameSamplingVideoTokenizer:
    """Tokenize video by sampling N frames."""
    def __init__(self, num_frames=16, resolution=224):
        self.num_frames = num_frames
        self.resolution = resolution
    
    def tokenize(self, video_path):
        container = av.open(video_path)
        total_frames = container.streams.video[0].duration
        
        # Uniform temporal sampling
        indices = np.linspace(0, total_frames - 1, self.num_frames, dtype=int)
        
        frames = []
        for i, frame in enumerate(container.decode(video=0)):
            if i in indices:
                # Resize to fixed resolution
                frame = self.resize_frame(frame, self.resolution)
                frames.append(frame.to_ndarray(format="rgb24"))
        
        return np.stack(frames)  # (num_frames, H, W, 3)
    
    def resize_frame(self, frame, target_size):
        from PIL import Image
        pil_image = Image.fromarray(frame.to_ndarray(format="rgb24"))
        return pil_image.resize((target_size, target_size), Image.BILINEAR)

Common Failure: Temporal Grounding Errors

Video LLMs often hallucinate temporal relationships. When asked "What happened after the car turned right?", models may describe events that precede the referenced action or invent intermediate steps.

# Example prompt that commonly causes temporal confusion
prompt = """
The video shows: [video]

Based on the video, answer these questions:
1. What action occurs at the 5-second mark?
2. What happens immediately after?
3. What action was the person doing before that?

If you are uncertain, say so rather than guessing.
"""

# Mitigation: Force explicit temporal references
structured_prompt = """
The video contains events in sequence. Answer using ONLY information visible in the video.
- Frame 1-8 shows: [describe what model sees]
- Frame 9-16 shows: [describe what model sees]

Question: What happens after Frame 8?
Answer from Frames 9-16 only: 
"""