The demo answers in two seconds and costs nothing because you asked it three questions. Production is a different sport. Now it's ten thousand questions an hour, each one expected back before the user gets bored, each one costing real money, against an index that's out of date the moment a document changes. The retrieval quality you spent Wave 1 perfecting is necessary and not remotely sufficient — a brilliant answer that takes nine seconds and costs a dollar is a failed product. This chapter is about the three forces that decide whether a good RAG system survives contact with real traffic: latency, cost, and freshness, and the one technique — caching — that helps all three at once.
You can't cut latency you haven't located. A RAG query is a chain of steps, and the time is rarely spread evenly — usually one or two stages dominate. Here's a representative breakdown of a hybrid-plus-rerank pipeline, the kind you built across Wave 1.
Two consequences follow. First, the biggest latency win is usually streaming the generated answer — showing tokens as they're produced so the user sees progress in 300 milliseconds instead of staring at a spinner for two seconds. The total time is similar; the perceived time collapses. Second, every pre-retrieval LLM call from Chapter 08 is latency you're adding on top of an already-dominant generation step — which is exactly why that chapter insisted you add them only on measured need.
Cost tracks tokens, and tokens are dominated by two things: the generation call (input context + output) and, at ingestion, the embedding of your whole corpus. The biggest levers, in rough order of impact:
Real query traffic is wildly repetitive. A large fraction of questions are near-duplicates of questions already asked — "how do I reset my password," phrased a hundred ways. Every one of those that you answer from cache costs no generation, no retrieval, no money, and returns in milliseconds. There are three layers, increasingly powerful:
| Cache | Keys on | Catches |
|---|---|---|
| Exact-match | The literal query string | Identical repeated queries. Trivial; limited. |
| Semantic | The query embedding (near-duplicate meaning) | "reset my password" ≈ "how do I change my password." The big win. |
| Embedding | Chunk text → its vector | Re-embedding unchanged chunks at ingestion. Saves ingestion cost. |
The semantic cache is the powerful one, and it reuses the entire machinery you already have — it's just a tiny vector index of past question-answer pairs, queried before you run the real pipeline.
# Semantic cache: answer from a past near-identical question if one exists.
import numpy as np
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self, threshold=0.95):
self.model = SentenceTransformer("BAAI/bge-base-en-v1.5")
self.threshold = threshold # how similar counts as "same question"
self.questions, self.vectors, self.answers = [], [], []
def get(self, query):
if not self.vectors:
return None
q = self.model.encode(query, normalize_embeddings=True)
sims = np.array(self.vectors) @ q
best = int(np.argmax(sims))
if sims[best] >= self.threshold: # close enough → reuse the answer
return self.answers[best]
return None
def put(self, query, answer):
self.questions.append(query)
self.vectors.append(self.model.encode(query, normalize_embeddings=True))
self.answers.append(answer)
cache = SemanticCache(threshold=0.95)
def answer_with_cache(query, pipeline):
hit = cache.get(query)
if hit is not None:
return hit, "cache" # milliseconds, free
answer = pipeline.run(query) # full retrieve + generate
cache.put(query, answer)
return answer, "generated"
print(answer_with_cache("how do I reset my password", pipeline)[1]) # generated
print(answer_with_cache("how can I change my password", pipeline)[1]) # cache
generated
cache
The second question, phrased differently, hit the cache because its meaning was within the similarity threshold of the first. That's a generation call and a full retrieval saved. The one critical knob is the threshold: too low and you'll serve the cached answer to a question that only seemed similar (a correctness bug); too high and you cache almost nothing. Tune it against your eval set, and be conservative — a wrong cached answer is worse than a slightly slower correct one. And invalidate the cache when the underlying documents change, or you'll happily serve yesterday's answer forever.
The naive approach to keeping an index current is to rebuild it nightly. It's simple and it's a trap: your index is up to twenty-four hours stale, the rebuild is expensive, and it doesn't scale as the corpus grows. The production approach is incremental — when a document changes, re-process only that document: re-chunk it, re-embed its chunks, upsert them into the index, and delete the chunks for any content that was removed. The index stays current to the minute, and you pay only for what changed.
The subtlety is deletion and update, not insertion. Adding new chunks is easy; the bugs live in making sure that when a document is edited, its old chunks are removed — otherwise stale and fresh versions of the same content coexist in the index and retrieval returns both, sometimes preferring the outdated one. Track which chunks belong to which document (metadata, again, from Chapter 02) so you can cleanly replace a document's whole chunk set on change.
My take. If you do one production optimisation, make it a semantic cache, and if you do two, add streaming. The cache attacks cost and latency together on the repetitive majority of traffic, and streaming fixes perceived latency on everything else — between them they address the two complaints users actually voice. Everything else (index tuning, model routing, quantization) is real but secondary. Optimise in the order the user feels: perceived speed, then cost, then the long tail.
Instrument one query end to end and record the milliseconds at each stage: query understanding, embed, retrieve, rerank, generate. Find your giant. It's almost always generation — and seeing it in your own numbers redirects your optimisation effort to where it counts.
Drop the cache above in front of your pipeline and replay a realistic set of queries with natural repetition and rephrasing. Measure the hit rate and the latency on hits. Then probe the threshold: find a near-duplicate that shouldn't share an answer and confirm your threshold keeps them apart.
Take an indexed document, edit it, and write the update path: delete its old chunks, re-chunk, re-embed, upsert. Then query for content you removed and confirm the stale chunks are truly gone. Getting this one path right is most of what production freshness requires.
Next chapter: Security and compliance — injection, access control. A RAG system retrieves from your private data and feeds untrusted text to a model — two facts that make it a security surface most tutorials ignore entirely. Prompt injection through retrieved content, per-user access control, and the compliance questions that decide whether you can ship at all.
Sign in to join the discussion and post comments.
Sign in