Every chapter of Wave 1 quietly assumed your data is text. Most real corpora aren't. They're full of diagrams that carry the actual explanation, screenshots where the answer is a red arrow pointing at a button, scanned forms, product photos, video walkthroughs, recorded support calls. Run those through a text-only pipeline and you either lose them entirely or flatten them into a caption that misses the point. Multi-modal RAG is about retrieving over content that isn't text — and the honest truth is that it's more expensive, more fiddly, and worth it less often than the demos suggest. This chapter shows you how it works and, just as importantly, when not to bother.
There are exactly two strategies, and choosing between them is most of the decision. Either you describe the image in text and retrieve over that text with the pipeline you already built, or you embed the image directly into a vector space shared with text so a text query can match an image. The first reuses everything from Wave 1; the second needs new machinery but preserves detail the caption would lose.
Path B rests on one genuinely clever idea: models (the CLIP family and its descendants) trained so that an image and a piece of text describing it land at nearly the same point in vector space. A photo of a red sneaker and the words "red sneaker" produce close vectors, even though one is pixels and the other is letters. Once images and text live in the same space, the cosine similarity from Chapter 01 just works across modalities — a text query retrieves images, an image query retrieves text, all through the same nearest-neighbour search you already built in Chapter 05. The vector index doesn't care that some vectors came from pixels.
The pragmatic path for most teams. A vision-capable model writes a rich description of each image at ingestion time; you index that text with the exact Wave 1 pipeline. The trick is prompting the captioner for the details your users will actually search on, not a generic "a photo of...".
# Caption images with a vision model, then index the captions as text.
CAPTION_PROMPT = """Describe this image for search. Include: what it shows,
any visible text or labels, UI elements and their state, colours that carry
meaning, and what a user might ask to find it. Be specific and factual."""
def caption_image(image_bytes, vision_llm):
return vision_llm.describe(image_bytes, prompt=CAPTION_PROMPT)
def index_image(image_id, image_bytes, vision_llm, store):
caption = caption_image(image_bytes, vision_llm)
# store the caption as a normal chunk, with a pointer back to the image
store.add_chunk(
text=caption,
metadata={"type": "image", "image_id": image_id}) # Ch02 metadata habit
# A "Settings screenshot with the Delete Account button highlighted in red"
# now retrieves for the query "where is the delete account button" — because
# the caption captured the button, its label, and its colour.
This works because the caption is searched with all the machinery you already trust — hybrid search, reranking, the lot. Its ceiling is the caption: anything the captioner didn't mention is unfindable. That's the trade. For screenshots, diagrams, and documents-as-images, a well-prompted caption is usually enough and is dramatically simpler than standing up a second, multi-modal index.
Video and audio feel intimidating until you notice they decompose into text and images:
The lesson: don't reach for exotic video-native models first. Transcribe, keep timestamps, and you've converted a scary modality into the text retrieval you already do well. Add keyframe images only if your evals show the transcript alone is missing visual answers.
Sometimes the cleanest answer isn't multi-modal retrieval at all. If you can already narrow to the right handful of images by other means — they're attached to a known document, or filtered by metadata — you can hand the actual images to a vision-capable model at generation time and let it read them directly. No image embeddings, no second index; the grounding step from Chapter 09 just receives images alongside text. This is underused. When the retrieval problem is "which document," not "which image across millions," a vision model at generation time beats building multi-modal search infrastructure you don't need.
| Dimension | Caption-then-embed | Native multi-modal |
|---|---|---|
| Setup cost | Low — reuses Wave 1 | High — new embedder + index |
| Ingestion cost | A vision call per image | An embed call per image |
| Detail preserved | Only what the caption says | Fine visual detail retained |
| Text-query quality | Excellent (it's text) | Often weaker than text-only models |
| Best for | Screenshots, diagrams, docs-as-images | Large photo libraries, fine visual search |
My take. Start with caption-then-embed almost every time. It reuses your whole battle-tested text pipeline, it's debuggable (you can read the caption and see why something did or didn't match), and for the most common case — screenshots and diagrams in documentation — it's genuinely enough. Reach for native multi-modal embeddings only when you've measured that captions are losing detail your users search on, typically in large photo or product-image libraries. The fashionable choice is the shiny shared-embedding-space approach; the right choice is usually the boring caption.
Take an image from your corpus where the answer is visual — a screenshot with a highlighted control, a diagram. Write a caption prompt, generate a caption, and ask: would the query a user types actually match this caption? Iterate the prompt until it does. This tells you how far caption-then-embed can take you.
Take one video and transcribe it. Chunk the transcript with timestamps and try retrieving a moment by querying its content. Notice how much of the video's value was in the words — usually most of it — and where a keyframe would genuinely add something the transcript missed.
Look at your image use case and ask honestly: do you need to search images across the whole corpus, or do you already know which document's images are relevant from text retrieval? If the latter, you may be able to skip multi-modal retrieval entirely and just show the images to a vision model at answer time.
Next chapter: RAG for code — AST-aware, symbol-aware, repo-scale. Code is text, but chunking it like prose destroys it. We'll see why retrieval over a codebase is its own discipline — function boundaries, symbol graphs, and why "split every 800 characters" ruins a function.
Sign in to join the discussion and post comments.
Sign in