04. Image Captioning

Chapter 4 of 18 · 20 min

KEY INSIGHT

Image captioning converts visual content into natural language descriptions. Prompt engineering significantly affects output qualityΓÇöspecific, structured prompts yield more consistent results than open-ended queries. Image captioning generates textual descriptions of image content. Multi-modal models process the image through the vision encoder and produce text through autoregressive generation. The quality depends on prompt formulation, image resolution, and model capabilities. Basic caption generation: ```python from PIL import Image import torch def generate_caption(model, processor, image_path, max_new_tokens=100): image = Image.open(image_path).convert("RGB") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False ) caption = processor.batch_decode(output, skip_special_tokens=True)[0] return caption # Usage caption = generate_caption(model, processor, "photos/landscape.jpg") print(caption) ``` Prompt variations affect output style: ```python # Brief factual caption prompt = "Provide a concise, objective caption." # Detailed descriptive caption prompt = "Describe all visible objects, their positions, colors, and any text present." # Narrative caption prompt = "Write a caption as if for a photojournalism article." ``` Failure modes in captioning: - **Truncation**: Images may be center-cropped, losing peripheral details - **Hallucination**: Models sometimes describe non-existent objects - **Text unreadability**: Small text often gets skipped entirely Benchmark caption quality with reference captions: ```python from datasets import load_metric # Load captioning metrics bleu = load_metric("bleu") def evaluate_caption(prediction, references): return bleu.compute( predictions=[prediction.split()], references=[ref.split() for ref in references] ) ```

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

EXERCISE

Create a captioning pipeline processing 10 images from your dataset. Generate captions with three different prompt styles and save results. Evaluate consistency across images.