On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Multi-modal RAG — images, video, audio

Every chapter of Wave 1 quietly assumed your data is text. Most real corpora aren't. They're full of diagrams that carry the actual explanation, screenshots where the answer is a red arrow pointing at a button, scanned forms, product photos, video walkthroughs, recorded support calls. Run those through a text-only pipeline and you either lose them entirely or flatten them into a caption that misses the point. Multi-modal RAG is about retrieving over content that isn't text — and the honest truth is that it's more expensive, more fiddly, and worth it less often than the demos suggest. This chapter shows you how it works and, just as importantly, when not to bother.

What you'll take away from this chapter

The two fundamentally different ways to make non-text content retrievable
What a shared image-text embedding space is, and why it's the key idea
How video and audio collapse into problems you already know how to solve
When a vision model at generation time beats multi-modal retrieval entirely
The honest cost, and the test for whether multi-modal earns its complexity for you

Two ways to make an image retrievable

There are exactly two strategies, and choosing between them is most of the decision. Either you describe the image in text and retrieve over that text with the pipeline you already built, or you embed the image directly into a vector space shared with text so a text query can match an image. The first reuses everything from Wave 1; the second needs new machinery but preserves detail the caption would lose.

Path A turns the image into text and reuses your entire existing pipeline — simple, but only as good as the caption. Path B embeds the image into a space shared with text so queries match it directly — richer, but new infrastructure and often weaker on pure-text queries.

The shared embedding space

Path B rests on one genuinely clever idea: models (the CLIP family and its descendants) trained so that an image and a piece of text describing it land at nearly the same point in vector space. A photo of a red sneaker and the words "red sneaker" produce close vectors, even though one is pixels and the other is letters. Once images and text live in the same space, the cosine similarity from Chapter 01 just works across modalities — a text query retrieves images, an image query retrieves text, all through the same nearest-neighbour search you already built in Chapter 05. The vector index doesn't care that some vectors came from pixels.

Caption-then-embed, in code

The pragmatic path for most teams. A vision-capable model writes a rich description of each image at ingestion time; you index that text with the exact Wave 1 pipeline. The trick is prompting the captioner for the details your users will actually search on, not a generic "a photo of...".

# Caption images with a vision model, then index the captions as text.
CAPTION_PROMPT = """Describe this image for search. Include: what it shows,
any visible text or labels, UI elements and their state, colours that carry
meaning, and what a user might ask to find it. Be specific and factual."""

def caption_image(image_bytes, vision_llm):
    return vision_llm.describe(image_bytes, prompt=CAPTION_PROMPT)

def index_image(image_id, image_bytes, vision_llm, store):
    caption = caption_image(image_bytes, vision_llm)
    # store the caption as a normal chunk, with a pointer back to the image
    store.add_chunk(
        text=caption,
        metadata={"type": "image", "image_id": image_id})  # Ch02 metadata habit

# A "Settings screenshot with the Delete Account button highlighted in red"
# now retrieves for the query "where is the delete account button" — because
# the caption captured the button, its label, and its colour.

This works because the caption is searched with all the machinery you already trust — hybrid search, reranking, the lot. Its ceiling is the caption: anything the captioner didn't mention is unfindable. That's the trade. For screenshots, diagrams, and documents-as-images, a well-prompted caption is usually enough and is dramatically simpler than standing up a second, multi-modal index.

Video and audio are problems you've already solved

Video and audio feel intimidating until you notice they decompose into text and images:

Audio → transcript. Transcribe the audio (a support call, a podcast) and you have text — chunk it, embed it, retrieve it, exactly as in Wave 1. Keep timestamps as metadata so a retrieved chunk can link back to the moment in the recording.
Video → transcript + keyframes. A video is an audio track (→ transcript) plus a sequence of frames (→ images, handled by either path above). Most video RAG is 90% transcript retrieval with keyframes added only when the visual matters — a UI demo, a whiteboard.

The lesson: don't reach for exotic video-native models first. Transcribe, keep timestamps, and you've converted a scary modality into the text retrieval you already do well. Add keyframe images only if your evals show the transcript alone is missing visual answers.

The third option: skip retrieval, let the model look

Sometimes the cleanest answer isn't multi-modal retrieval at all. If you can already narrow to the right handful of images by other means — they're attached to a known document, or filtered by metadata — you can hand the actual images to a vision-capable model at generation time and let it read them directly. No image embeddings, no second index; the grounding step from Chapter 09 just receives images alongside text. This is underused. When the retrieval problem is "which document," not "which image across millions," a vision model at generation time beats building multi-modal search infrastructure you don't need.

Caption-then-embed vs native, measured

Dimension	Caption-then-embed	Native multi-modal
Setup cost	Low — reuses Wave 1	High — new embedder + index
Ingestion cost	A vision call per image	An embed call per image
Detail preserved	Only what the caption says	Fine visual detail retained
Text-query quality	Excellent (it's text)	Often weaker than text-only models
Best for	Screenshots, diagrams, docs-as-images	Large photo libraries, fine visual search

My take. Start with caption-then-embed almost every time. It reuses your whole battle-tested text pipeline, it's debuggable (you can read the caption and see why something did or didn't match), and for the most common case — screenshots and diagrams in documentation — it's genuinely enough. Reach for native multi-modal embeddings only when you've measured that captions are losing detail your users search on, typically in large photo or product-image libraries. The fashionable choice is the shiny shared-embedding-space approach; the right choice is usually the boring caption.

When this fails

Generic captions. "A screenshot of an application" is useless for retrieval. The caption must capture the searchable specifics — labels, states, colours that carry meaning. Prompt the captioner deliberately, then read a sample of captions with your own eyes (the Wave 1 habit applies to captions too).
Native embeddings on text-heavy queries. Multi-modal embedders are often weaker than dedicated text embedders on pure-text retrieval. If most of your queries and corpus are text with occasional images, a text pipeline plus captioned images beats a fully multi-modal index.
Indexing every video frame. A frame every second is mostly redundant near-duplicates that bloat the index and crowd results. Sample keyframes at scene changes, and lean on the transcript.
Dropping timestamps and source links. A retrieved transcript chunk or keyframe that can't point back to the moment in the recording can't cite itself. Carry timestamp and source metadata, per Chapter 02.
Building multi-modal search when generation-time vision would do. If you can narrow to a few images cheaply, handing them to a vision model at answer time avoids an entire index. Don't build infrastructure the problem doesn't require.

Practice — before you read the next chapter

Caption your hardest image

Take an image from your corpus where the answer is visual — a screenshot with a highlighted control, a diagram. Write a caption prompt, generate a caption, and ask: would the query a user types actually match this caption? Iterate the prompt until it does. This tells you how far caption-then-embed can take you.

Decompose a video

Take one video and transcribe it. Chunk the transcript with timestamps and try retrieving a moment by querying its content. Notice how much of the video's value was in the words — usually most of it — and where a keyframe would genuinely add something the transcript missed.

Find your generation-time shortcut

Look at your image use case and ask honestly: do you need to search images across the whole corpus, or do you already know which document's images are relevant from text retrieval? If the latter, you may be able to skip multi-modal retrieval entirely and just show the images to a vision model at answer time.

Takeaways

Two strategies for non-text content: caption-then-embed (reuses your text pipeline) or native multi-modal embeddings (shared image-text vector space). Choosing between them is most of the work.
A shared embedding space lets a text query match an image directly through ordinary nearest-neighbour search — the index doesn't care the vector came from pixels.
Audio becomes a transcript; video becomes a transcript plus keyframes. Both collapse into the text and image retrieval you already do.
Sometimes the answer is no retrieval at all — narrow by metadata, then let a vision model read the images at generation time.
Start with caption-then-embed. Go native only when measurements show captions are losing detail your users search on.

Next chapter: RAG for code — AST-aware, symbol-aware, repo-scale. Code is text, but chunking it like prose destroys it. We'll see why retrieval over a codebase is its own discipline — function boundaries, symbol graphs, and why "split every 800 characters" ruins a function.

Discussion

Evaluation — the part most series skip RAG for code — AST-aware, symbol-aware, repo-scale