05. Visual Question Answering

Chapter 5 of 18 · 15 min

KEY INSIGHT

Visual Question Answering (VQA) combines image understanding with language generation, allowing free-form questions about visual content. Structured prompts with explicit context improve answer accuracy. Visual Question Answering extends captioning by accepting user questions. The model must identify relevant visual elements, reason about relationships, and format answers appropriately. This task is more challenging than captioning because answers must address specific queries. ```python def answer_visual_question(model, processor, image_path, question): image = Image.open(image_path).convert("RGB") conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": f"Question: {question}\nAnswer concisely."} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=150, do_sample=False, temperature=None # Deterministic for Q&A ) answer = processor.batch_decode(output, skip_special_tokens=True)[0] return answer # Example questions questions = [ "What is the main subject of this image?", "How many people are visible in the scene?", "What colors dominate the background?", "Is there any text visible? If so, what does it say?", "What time of day does this image appear to show?" ] for question in questions: answer = answer_visual_question(model, processor, "test.jpg", question) print(f"Q: {question}\nA: {answer}\n") ``` Multi-turn VQA enables follow-up questions: ```python conversation_history = [] def multi_turn_vqa(image_path, question): global conversation_history conversation_history.append({ "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": question} ] }) # Include previous turns for context conversation = conversation_history.copy() prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, # Only include image for first turn or every turn text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=150) response = processor.batch_decode(output, skip_special_tokens=True)[0] conversation_history.append({ "role": "assistant", "content": response }) return response ``` Common failure modes in VQA: - **Ambiguous questions**: "What is this?" produces varied responses - **Counting errors**: Models struggle with precise counts - **Spatial reasoning**: Questions about relative positions often fail

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Build an interactive VQA script that maintains conversation history and allows multi-turn dialogue on a single image. Test edge cases like yes/no questions, counting, and spatial relationships.