RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Multi-Modal AI: Vision and Text
  6. /Ch. 5
Multi-Modal AI: Vision and Text

05. Visual Question Answering

Chapter 5 of 18 · 15 min
KEY INSIGHT

Visual Question Answering (VQA) combines image understanding with language generation, allowing free-form questions about visual content. Structured prompts with explicit context improve answer accuracy. Visual Question Answering extends captioning by accepting user questions. The model must identify relevant visual elements, reason about relationships, and format answers appropriately. This task is more challenging than captioning because answers must address specific queries. ```python def answer_visual_question(model, processor, image_path, question): image = Image.open(image_path).convert("RGB") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": f"Question: {question}\nAnswer concisely."} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=150, do_sample=False, temperature=None # Deterministic for Q&A ) answer = processor.batch_decode(output, skip_special_tokens=True)[0] return answer # Example questions questions = [ "What is the main subject of this image?", "How many people are visible in the scene?", "What colors dominate the background?", "Is there any text visible? If so, what does it say?", "What time of day does this image appear to show?" ] for question in questions: answer = answer_visual_question(model, processor, "test.jpg", question) print(f"Q: {question}\nA: {answer}\n") ``` Multi-turn VQA enables follow-up questions: ```python conversation_history = [] def multi_turn_vqa(image_path, question): global conversation_history conversation_history.append({ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": question} ] }) # Include previous turns for context conversation = conversation_history.copy() prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, # Only include image for first turn or every turn text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=150) response = processor.batch_decode(output, skip_special_tokens=True)[0] conversation_history.append({ "role": "assistant", "content": response }) return response ``` Common failure modes in VQA: - **Ambiguous questions**: "What is this?" produces varied responses - **Counting errors**: Models struggle with precise counts - **Spatial reasoning**: Questions about relative positions often fail

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Build an interactive VQA script that maintains conversation history and allows multi-turn dialogue on a single image. Test edge cases like yes/no questions, counting, and spatial relationships.

← Chapter 4
Image Captioning
Chapter 6 →
Chart and Diagram Understanding