01. Beyond Images
Multi-modal AI extends far beyond static image classification. Modern systems process video, audio, 3D point clouds, sensor data, and documents with interleaved text and visuals. This chapter establishes the conceptual framework for understanding why multiple modalities matter and how they connect.
The Modality Spectrum
Each modality carries distinct information. Text excels at explicit reasoning and instruction following. Images capture spatial relationships and appearance. Video adds temporal dynamics. Audio conveys emotion, prosody, and environmental sounds. Point clouds provide precise geometry. The magic happens when these representations inform each other.
Consider a kitchen robot. Vision tells it where objects are. Language tells it what to cook. Audio tells it someone is calling for help. Force sensors tell it the pot is slipping. No single modality suffices—integration is mandatory.
Joint Embedding Spaces
Multi-modal models work by projecting different inputs into a shared representation space. A query in one modality finds relevant content in another through vector similarity.
# Conceptual joint embedding
def project_to_joint_space(text_embed, image_embed, audio_embed):
text_projected = text_proj(text_embed) # (d,)
image_projected = image_proj(image_embed) # (d,)
audio_projected = audio_proj(audio_embed) # (d,)
# All three now live in the same d-dimensional space
return text_projected, image_projected, audio_projected
def compute_cross_modal_similarity(query_modality, query_vec, target_vec):
# Cosine similarity in joint space
return np.dot(query_vec, target_vec) / (
np.linalg.norm(query_vec) * np.linalg.norm(target_vec)
)
Alignment and Fusion
Two fundamental operations exist: alignment and fusion. Alignment matches corresponding elements across modalities (this frame matches this transcript segment). Fusion combines information into a unified representation. Many architectures do both sequentially—align first, then fuse.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Audit an existing single-modality ML pipeline. List three scenarios where adding a second modality would improve decisions. For each, identify what information the second modality provides that the first cannot.