Evaluation Metrics — Advanced Multi-Modal Systems (Chapter 18)

Video multimodal evaluation requires metrics that capture temporal dynamics, not just per-frame accuracy. Standard image classification metrics miss critical failure modes unique to video understanding.

Temporal consistency metrics evaluate whether model predictions remain stable over time for similar content. Fllickering predictions indicate instability that would be unusable in production. Hamming distance between consecutive frame-level predictions provides a simple consistency measure.

def temporal_consistency_score(predictions):
    """Calculate consistency as normalized hamming distance between consecutive predictions"""
    if len(predictions) < 2:
        return 1.0
    
    transitions = sum(
        bin(p1 ^ p2).count('1') 
        for p1, p2 in zip(predictions[:-1], predictions[1:])
    )
    max_transitions = len(predictions) - 1
    num_bits = predictions[0].bit_length()
    
    return 1.0 - (transitions / (max_transitions * num_bits))

Action recognition metrics evaluate temporal localization accuracy. Mean Average Precision (mAP) across action classes measures both classification and temporal boundary accuracy. IoU thresholds determine whether a detected action matches ground truth boundaries.

Cross-modal consistency measures whether different modalities produce coherent interpretations. A video+audio model should not classify "speaking" when the video shows a silent person and the audio shows ambient noise. Cross-modal agreement metrics identify training instabilities or data quality issues.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.