On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

Production — latency, cost, freshness, caching

The demo answers in two seconds and costs nothing because you asked it three questions. Production is a different sport. Now it's ten thousand questions an hour, each one expected back before the user gets bored, each one costing real money, against an index that's out of date the moment a document changes. The retrieval quality you spent Wave 1 perfecting is necessary and not remotely sufficient — a brilliant answer that takes nine seconds and costs a dollar is a failed product. This chapter is about the three forces that decide whether a good RAG system survives contact with real traffic: latency, cost, and freshness, and the one technique — caching — that helps all three at once.

What you'll take away from this chapter

Where the milliseconds actually go in a RAG query, and which ones are worth cutting
Where the dollars go, and the few levers that move cost the most
Caching — exact, semantic, and embedding — the highest-leverage production optimisation
Keeping an index fresh with incremental updates instead of nightly rebuilds
The failure modes that only appear under real load

Where the milliseconds go

You can't cut latency you haven't located. A RAG query is a chain of steps, and the time is rarely spread evenly — usually one or two stages dominate. Here's a representative breakdown of a hybrid-plus-rerank pipeline, the kind you built across Wave 1.

The lesson of every latency budget: generation dominates. Teams instinctively optimise retrieval because that's the part they built, but the user is waiting on the model writing tokens. Measure your own budget before optimising — and expect generation to be the giant.

Two consequences follow. First, the biggest latency win is usually streaming the generated answer — showing tokens as they're produced so the user sees progress in 300 milliseconds instead of staring at a spinner for two seconds. The total time is similar; the perceived time collapses. Second, every pre-retrieval LLM call from Chapter 08 is latency you're adding on top of an already-dominant generation step — which is exactly why that chapter insisted you add them only on measured need.

Where the dollars go

Cost tracks tokens, and tokens are dominated by two things: the generation call (input context + output) and, at ingestion, the embedding of your whole corpus. The biggest levers, in rough order of impact:

Context size. Every chunk you stuff into the prompt is input tokens you pay for on every query. This is the cost argument for sending few, well-reranked chunks rather than twenty mediocre ones — reranking (Chapter 07) saves money as well as improving quality.
Model choice for generation. A smaller model for routine answers and a larger one only for hard queries (routed per Chapter 08) can cut the generation bill dramatically without users noticing.
Re-embedding churn. If you re-embed more than you need to — re-processing unchanged documents — you pay for ingestion you didn't need. Embed only what changed.
Caching. The biggest lever of all, because a cached answer costs nothing. Which is the next section.

Caching — the highest-leverage optimisation

Real query traffic is wildly repetitive. A large fraction of questions are near-duplicates of questions already asked — "how do I reset my password," phrased a hundred ways. Every one of those that you answer from cache costs no generation, no retrieval, no money, and returns in milliseconds. There are three layers, increasingly powerful:

Cache	Keys on	Catches
Exact-match	The literal query string	Identical repeated queries. Trivial; limited.
Semantic	The query embedding (near-duplicate meaning)	"reset my password" ≈ "how do I change my password." The big win.
Embedding	Chunk text → its vector	Re-embedding unchanged chunks at ingestion. Saves ingestion cost.

The semantic cache is the powerful one, and it reuses the entire machinery you already have — it's just a tiny vector index of past question-answer pairs, queried before you run the real pipeline.

# Semantic cache: answer from a past near-identical question if one exists.
import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, threshold=0.95):
        self.model = SentenceTransformer("BAAI/bge-base-en-v1.5")
        self.threshold = threshold        # how similar counts as "same question"
        self.questions, self.vectors, self.answers = [], [], []

    def get(self, query):
        if not self.vectors:
            return None
        q = self.model.encode(query, normalize_embeddings=True)
        sims = np.array(self.vectors) @ q
        best = int(np.argmax(sims))
        if sims[best] >= self.threshold:      # close enough → reuse the answer
            return self.answers[best]
        return None

    def put(self, query, answer):
        self.questions.append(query)
        self.vectors.append(self.model.encode(query, normalize_embeddings=True))
        self.answers.append(answer)

cache = SemanticCache(threshold=0.95)

def answer_with_cache(query, pipeline):
    hit = cache.get(query)
    if hit is not None:
        return hit, "cache"               # milliseconds, free
    answer = pipeline.run(query)          # full retrieve + generate
    cache.put(query, answer)
    return answer, "generated"

print(answer_with_cache("how do I reset my password", pipeline)[1])  # generated
print(answer_with_cache("how can I change my password", pipeline)[1]) # cache

generated
cache

The second question, phrased differently, hit the cache because its meaning was within the similarity threshold of the first. That's a generation call and a full retrieval saved. The one critical knob is the threshold: too low and you'll serve the cached answer to a question that only seemed similar (a correctness bug); too high and you cache almost nothing. Tune it against your eval set, and be conservative — a wrong cached answer is worse than a slightly slower correct one. And invalidate the cache when the underlying documents change, or you'll happily serve yesterday's answer forever.

Freshness — incremental over rebuild

The naive approach to keeping an index current is to rebuild it nightly. It's simple and it's a trap: your index is up to twenty-four hours stale, the rebuild is expensive, and it doesn't scale as the corpus grows. The production approach is incremental — when a document changes, re-process only that document: re-chunk it, re-embed its chunks, upsert them into the index, and delete the chunks for any content that was removed. The index stays current to the minute, and you pay only for what changed.

The subtlety is deletion and update, not insertion. Adding new chunks is easy; the bugs live in making sure that when a document is edited, its old chunks are removed — otherwise stale and fresh versions of the same content coexist in the index and retrieval returns both, sometimes preferring the outdated one. Track which chunks belong to which document (metadata, again, from Chapter 02) so you can cleanly replace a document's whole chunk set on change.

My take. If you do one production optimisation, make it a semantic cache, and if you do two, add streaming. The cache attacks cost and latency together on the repetitive majority of traffic, and streaming fixes perceived latency on everything else — between them they address the two complaints users actually voice. Everything else (index tuning, model routing, quantization) is real but secondary. Optimise in the order the user feels: perceived speed, then cost, then the long tail.

When this fails

Optimising retrieval while generation dominates. Shaving milliseconds off vector search while a 1.5-second generation sits untouched is effort the user never feels. Measure the budget; optimise the giant.
A semantic cache threshold set too low. Serve a cached answer to a merely-similar question and you've shipped a confidently wrong response. Tune conservatively against your eval set; correctness beats the cache-hit rate.
Caches that never invalidate. A cached answer outlives the document it came from, so users get last month's policy forever. Invalidate on source change — a cache without invalidation is a staleness bug with good latency.
Nightly rebuilds at scale. They leave the index hours stale and grow more expensive as the corpus does. Move to incremental indexing, and get the document-replacement (delete-old) path right.
Duplicate chunks after edits. Incremental indexing that inserts new chunks without deleting the old ones leaves both versions live. Replace a document's entire chunk set on change, keyed by document id.
No load testing. A pipeline that's fine at three queries falls over at three thousand — connection pools exhaust, rerankers queue, the model API rate-limits. Load-test before launch, not after the incident.

Practice — before you read the next chapter

Measure your real latency budget

Instrument one query end to end and record the milliseconds at each stage: query understanding, embed, retrieve, rerank, generate. Find your giant. It's almost always generation — and seeing it in your own numbers redirects your optimisation effort to where it counts.

Add a semantic cache

Drop the cache above in front of your pipeline and replay a realistic set of queries with natural repetition and rephrasing. Measure the hit rate and the latency on hits. Then probe the threshold: find a near-duplicate that shouldn't share an answer and confirm your threshold keeps them apart.

Make one incremental update

Take an indexed document, edit it, and write the update path: delete its old chunks, re-chunk, re-embed, upsert. Then query for content you removed and confirm the stale chunks are truly gone. Getting this one path right is most of what production freshness requires.

Takeaways

Generation dominates the latency budget. Optimise there first — and stream the answer so perceived latency collapses even when total time doesn't.
Cost tracks tokens: context size and generation-model choice are the big levers. Fewer, better-reranked chunks save money and improve quality at once.
Caching is the highest-leverage optimisation. A semantic cache reuses your vector machinery to answer repetitive near-duplicate questions for free — tune its threshold conservatively and invalidate on change.
Keep the index fresh incrementally, not by nightly rebuild. The hard part is cleanly replacing a changed document's chunks so stale and fresh versions never coexist.
Load-test before launch. The failures that matter only appear under real concurrency.

Next chapter: Security and compliance — injection, access control. A RAG system retrieves from your private data and feeds untrusted text to a model — two facts that make it a security surface most tutorials ignore entirely. Prompt injection through retrieved content, per-user access control, and the compliance questions that decide whether you can ship at all.

Discussion

Graph RAG — when graphs beat vectors Security and compliance — injection, access control

Production — latency, cost, freshness, caching

What you'll take away from this chapter

Where the milliseconds actually go in a RAG query, and which ones are worth cutting
Where the dollars go, and the few levers that move cost the most
Caching — exact, semantic, and embedding — the highest-leverage production optimisation
Keeping an index fresh with incremental updates instead of nightly rebuilds
The failure modes that only appear under real load

Where the milliseconds go

Where the dollars go

Context size. Every chunk you stuff into the prompt is input tokens you pay for on every query. This is the cost argument for sending few, well-reranked chunks rather than twenty mediocre ones — reranking (Chapter 07) saves money as well as improving quality.
Model choice for generation. A smaller model for routine answers and a larger one only for hard queries (routed per Chapter 08) can cut the generation bill dramatically without users noticing.
Re-embedding churn. If you re-embed more than you need to — re-processing unchanged documents — you pay for ingestion you didn't need. Embed only what changed.
Caching. The biggest lever of all, because a cached answer costs nothing. Which is the next section.

Caching — the highest-leverage optimisation

Cache	Keys on	Catches
Exact-match	The literal query string	Identical repeated queries. Trivial; limited.
Semantic	The query embedding (near-duplicate meaning)	"reset my password" ≈ "how do I change my password." The big win.
Embedding	Chunk text → its vector	Re-embedding unchanged chunks at ingestion. Saves ingestion cost.

The semantic cache is the powerful one, and it reuses the entire machinery you already have — it's just a tiny vector index of past question-answer pairs, queried before you run the real pipeline.

# Semantic cache: answer from a past near-identical question if one exists.
import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, threshold=0.95):
        self.model = SentenceTransformer("BAAI/bge-base-en-v1.5")
        self.threshold = threshold        # how similar counts as "same question"
        self.questions, self.vectors, self.answers = [], [], []

    def get(self, query):
        if not self.vectors:
            return None
        q = self.model.encode(query, normalize_embeddings=True)
        sims = np.array(self.vectors) @ q
        best = int(np.argmax(sims))
        if sims[best] >= self.threshold:      # close enough → reuse the answer
            return self.answers[best]
        return None

    def put(self, query, answer):
        self.questions.append(query)
        self.vectors.append(self.model.encode(query, normalize_embeddings=True))
        self.answers.append(answer)

cache = SemanticCache(threshold=0.95)

def answer_with_cache(query, pipeline):
    hit = cache.get(query)
    if hit is not None:
        return hit, "cache"               # milliseconds, free
    answer = pipeline.run(query)          # full retrieve + generate
    cache.put(query, answer)
    return answer, "generated"

print(answer_with_cache("how do I reset my password", pipeline)[1])  # generated
print(answer_with_cache("how can I change my password", pipeline)[1]) # cache

generated
cache

Freshness — incremental over rebuild

My take. If you do one production optimisation, make it a semantic cache, and if you do two, add streaming. The cache attacks cost and latency together on the repetitive majority of traffic, and streaming fixes perceived latency on everything else — between them they address the two complaints users actually voice. Everything else (index tuning, model routing, quantization) is real but secondary. Optimise in the order the user feels: perceived speed, then cost, then the long tail.

When this fails

Optimising retrieval while generation dominates. Shaving milliseconds off vector search while a 1.5-second generation sits untouched is effort the user never feels. Measure the budget; optimise the giant.
A semantic cache threshold set too low. Serve a cached answer to a merely-similar question and you've shipped a confidently wrong response. Tune conservatively against your eval set; correctness beats the cache-hit rate.
Caches that never invalidate. A cached answer outlives the document it came from, so users get last month's policy forever. Invalidate on source change — a cache without invalidation is a staleness bug with good latency.
Nightly rebuilds at scale. They leave the index hours stale and grow more expensive as the corpus does. Move to incremental indexing, and get the document-replacement (delete-old) path right.
Duplicate chunks after edits. Incremental indexing that inserts new chunks without deleting the old ones leaves both versions live. Replace a document's entire chunk set on change, keyed by document id.
No load testing. A pipeline that's fine at three queries falls over at three thousand — connection pools exhaust, rerankers queue, the model API rate-limits. Load-test before launch, not after the incident.

Practice — before you read the next chapter

Measure your real latency budget

Add a semantic cache

Make one incremental update

Takeaways

Generation dominates the latency budget. Optimise there first — and stream the answer so perceived latency collapses even when total time doesn't.
Cost tracks tokens: context size and generation-model choice are the big levers. Fewer, better-reranked chunks save money and improve quality at once.
Caching is the highest-leverage optimisation. A semantic cache reuses your vector machinery to answer repetitive near-duplicate questions for free — tune its threshold conservatively and invalidate on change.
Keep the index fresh incrementally, not by nightly rebuild. The hard part is cleanly replacing a changed document's chunks so stale and fresh versions never coexist.
Load-test before launch. The failures that matter only appear under real concurrency.

Discussion

Graph RAG — when graphs beat vectors Security and compliance — injection, access control

Production — latency, cost, freshness, caching

What you'll take away from this chapter

Where the milliseconds go

Where the dollars go

Caching — the highest-leverage optimisation

Freshness — incremental over rebuild

When this fails

Practice — before you read the next chapter

Measure your real latency budget

Add a semantic cache

Make one incremental update

Takeaways

Discussion

Related Tutorials

Production — latency, cost, freshness, caching

What you'll take away from this chapter

Where the milliseconds go

Where the dollars go

Caching — the highest-leverage optimisation

Freshness — incremental over rebuild

When this fails

Practice — before you read the next chapter

Measure your real latency budget

Add a semantic cache

Make one incremental update

Takeaways

Discussion

Related Tutorials