Retrieval-Augmented Generation (RAG): Prompting with Your Own Data

RAG plugs a private knowledge base into a public model. Instead of fine-tuning, you retrieve the most relevant chunks of your data at query time and stuff them into the prompt. It is the single most important production technique in modern AI applications, and the one most often built poorly.

1. Introduction

Every business that wants to use AI on its own data runs into the same wall: the model has never seen your data, and pre-training a model on it is expensive, slow, and outdated the moment a document changes. RAG solves this by inverting the problem. The model stays generic and frozen; your data lives in a separate store; at query time you retrieve the relevant slice of data and pass it into the model's context window along with the question.

The technique was introduced in a 2020 paper by Lewis et al. but only became practical at scale once embedding models and vector databases matured. Today RAG powers most internal AI tools — knowledge-base assistants, contract analysers, customer-support bots, code-search systems. This tutorial focuses on the prompting half of RAG; we will reference the retrieval half but only at the conceptual level.

2. The Concept Explained

A RAG pipeline has four stages, each with its own design decisions. The prompt sits at the end, and how you assemble it determines whether retrieval pays off or whether you end up with confidently wrong answers stitched together from irrelevant chunks.

Index time. You chunk source documents into 200–800-token passages, generate an embedding (a vector) for each chunk, and store the chunks plus their embeddings in a vector database.
Retrieval. At query time, you embed the user's question and find the K nearest chunks by vector similarity. Modern systems combine this with keyword (BM25) search for hybrid retrieval.
Re-ranking. Optionally, you pass the top 20 chunks through a small re-ranker model that scores each chunk's relevance to the query and keeps only the best 4–6.
Prompt assembly. You build the final prompt: a system prompt that tells the model what to do, the retrieved chunks (clearly delimited and cited), and the user's question.

The RAG pipeline: query → embedding → top-K retrieval → prompt assembly → grounded answer.

3. The Problem Without RAG

Ask a generic model a question about your private data and you get one of three failure modes: it apologises and says it has no information; it confabulates a plausible-sounding answer based on what similar policies look like in general; or worst of all, it answers based on outdated public information that contradicts your current policy.

No retrieval

What is our company's return policy for fragile items
purchased through the wholesale channel?

The model has no idea. It will either refuse or hallucinate a "standard" 30-day policy that may bear no resemblance to your actual contract.

4. The Solution: A Well-Assembled RAG Prompt

Grounded RAG prompt

You are a customer support assistant. Answer the user's
question using ONLY the policy excerpts in <context>.

If the answer is not in the excerpts, say:
"I don't have that information. Please contact your
account manager."

Always cite the source like [doc-id, section].

<context>
[wholesale-returns-v4, §3.2]
Fragile items in the wholesale channel may be returned
within 14 days of receipt, provided the items are
unused, in original packaging, and accompanied by the
original delivery note. Restocking fee: 8%.

[wholesale-returns-v4, §3.3]
Customer is responsible for return shipping. Damaged-on-
arrival items follow the separate DOA procedure (see
wholesale-doa-v2).

[shipping-policy-v9, §6.1]
Standard retail returns are 30 days, full refund.
</context>

User question:
What is our company's return policy for fragile items
purchased through the wholesale channel?

The model answers precisely from the retrieved chunks, cites them, and falls back gracefully when the data isn't present. The retail-policy chunk is in context but the model correctly ignores it because the question specifies wholesale.

5. Step-by-Step Breakdown

Chunk thoughtfully. Chunk at semantic boundaries — paragraphs, sections, list items — not arbitrary character counts. Overlap chunks by 10–20% so context is not chopped mid-sentence.
Embed with care. Choose an embedding model that matches your domain language. Re-embed your corpus when you change the model — you cannot mix embeddings from different models in the same store.
Retrieve broadly, then narrow. Fetch the top 20 chunks, re-rank to 4–6. Pure top-K from vector similarity is brittle; hybrid retrieval (vector + keyword) is significantly more reliable.
Wrap the context in tags. Use <context>…</context> or similar so the model can clearly distinguish retrieved data from your instructions and from the user's message. This is also your injection defense.
Insist on grounding. Tell the model to answer only from the context, and to say so when the answer isn't there. Without this rule, the model fills gaps with hallucination.
Require citations. Force every claim to be tagged with the source chunk's id. Citations make the answer auditable and turn the model into a teammate rather than a black box.
Evaluate continuously. Track retrieval quality (did the right chunk make it into the context?) separately from generation quality (given the right chunk, was the answer correct?). They fail in different ways and need different fixes.

Tip: The most common RAG bug is not the model — it is retrieval missing the right chunk. Before tuning prompts, instrument retrieval. If the relevant chunk isn't in the top-K, no amount of prompt engineering will save you.

6. Practice Exercises

Exercise 1

Set up the smallest possible RAG: 20 short text passages in a file, an embedding API, an in-memory vector store, and a prompt that retrieves the top 3 by cosine similarity. Ask questions and inspect which chunks were retrieved for each.

Exercise 2

Deliberately ask questions that aren't in your corpus. Tune the prompt until the model reliably refuses with the fallback phrase instead of hallucinating. This is the single most important RAG behaviour to lock down.

Exercise 3

Add citations to your prompt. Then read 20 sample answers and check whether the cited chunk actually contains the claim. A surprisingly high fraction of "cited" answers cite the wrong chunk. Fixing this usually means adjusting retrieval, not the prompt.

7. Key Takeaways

RAG plugs private data into a generic model at query time — no fine-tuning required, always up-to-date.
The pipeline has four stages: chunk, embed, retrieve, assemble. Each stage has its own failure modes.
Most RAG bugs are retrieval bugs. Always measure retrieval quality separately from generation quality.
Wrap retrieved context in tags, instruct the model to answer only from context, and require citations on every claim.
Provide an explicit fall-back phrase for "not in the data" — without it, the model will quietly hallucinate.

Discussion

Using Delimiters, XML Tags, and Markdown to Structure Prompts Function Calling and Tool Use: Connecting AI to APIs