RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Multi-Modal AI: Vision and Text
  6. /Ch. 4
Multi-Modal AI: Vision and Text

04. Image Captioning

Chapter 4 of 18 · 20 min
KEY INSIGHT

Image captioning converts visual content into natural language descriptions. Prompt engineering significantly affects output qualityΓÇöspecific, structured prompts yield more consistent results than open-ended queries. Image captioning generates textual descriptions of image content. Multi-modal models process the image through the vision encoder and produce text through autoregressive generation. The quality depends on prompt formulation, image resolution, and model capabilities. Basic caption generation: ```python from PIL import Image import torch def generate_caption(model, processor, image_path, max_new_tokens=100): image = Image.open(image_path).convert("RGB") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False ) caption = processor.batch_decode(output, skip_special_tokens=True)[0] return caption # Usage caption = generate_caption(model, processor, "photos/landscape.jpg") print(caption) ``` Prompt variations affect output style: ```python # Brief factual caption prompt = "Provide a concise, objective caption." # Detailed descriptive caption prompt = "Describe all visible objects, their positions, colors, and any text present." # Narrative caption prompt = "Write a caption as if for a photojournalism article." ``` Failure modes in captioning: - **Truncation**: Images may be center-cropped, losing peripheral details - **Hallucination**: Models sometimes describe non-existent objects - **Text unreadability**: Small text often gets skipped entirely Benchmark caption quality with reference captions: ```python from datasets import load_metric # Load captioning metrics bleu = load_metric("bleu") def evaluate_caption(prediction, references): return bleu.compute( predictions=[prediction.split()], references=[ref.split() for ref in references] ) ```

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a captioning pipeline processing 10 images from your dataset. Generate captions with three different prompt styles and save results. Evaluate consistency across images.

← Chapter 3
BakLLaVA Setup
Chapter 5 →
Visual Question Answering