21. Synthetic Data
Synthetic data addresses multimodal training data scarcity by generating labeled training examples algorithmically. Video synthesis can produce unlimited variation in场景, lighting, and action while maintaining perfect labels.
Physics-based rendering engines (Unreal Engine, Unity) produce photorealistic video with accurate depth and surface normal annotations. These annotations enable self-supervised training where the model learns to predict rendering parameters. The main limitation: domain gap between rendered and real footage.
# Conceptual synthetic video generation
def generate_synthetic_clip(prompt, num_frames=16, resolution=(224, 224)):
"""
Generate synthetic video from text prompt.
Production systems would use trained video diffusion models
or game engine rendering pipelines.
"""
# In practice, this requires:
# 1. 3D scene generation from prompt
# 2. Camera path animation
# 3. Lighting simulation
# 4. Rendering at target framerate
# 5. Post-processing effects
scene = create_scene_from_prompt(prompt)
camera = animate_camera(scene, num_frames)
frames = render(scene, camera, num_frames)
# Synthetic labels from scene graph
labels = extract_action_labels(scene)
depth = extract_depth_maps(scene)
return {'frames': frames, 'labels': labels, 'depth': depth}
Sim-to-real transfer requires domain randomization during synthesis. Varying lighting, textures, camera parameters, and background clutter during training encourages the model to learn domain-invariant features. The key insight: randomization should match the distribution of real-world variation.
Audio-visual synthetic data can be generated by composing isolated audio tracks with silent video, then mixing to create complex scenes. This approach provides ground truth for audio source separation and visual localization that real data cannot easily provide.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Generate synthetic video clips using a text-to-video model (if available) or image interpolation. Compare feature distributions between synthetic and real video using Fréchet Inception Distance.