Here is the part nobody puts in the demo. Before you can chunk a document, embed it, or retrieve from it, you have to turn it into clean text — and your data does not arrive as clean text. It arrives as PDFs with two columns and a footer on every page, as HTML wrapped in three layers of navigation menus, as scanned contracts that are really just photographs of words, as spreadsheets where the meaning lives in the layout. Teams that skip this step build a beautiful retrieval pipeline on top of garbage and then wonder why the answers are wrong. The retrieval was perfect. The input was broken.
This chapter is about the unglamorous forty percent of real RAG work: getting trustworthy text out of untrustworthy formats. It is the chapter most tutorials skip, which is exactly why most tutorials produce demos that fall over on real documents.
In Chapter 00 we noted that retrieval faithfully fetches whatever is in your corpus, in both directions. Data prep is where that corpus is made. If your extraction merges a table's columns into a wordsoup, the chunk that contains the answer is now unreadable, and no embedding model on earth will rank it correctly. If your PDF parser reads a two-column page straight across, every sentence is spliced to the sentence beside it, and the meaning is gone before you ever embed a thing.
The uncomfortable truth: improving your parser usually buys more accuracy than improving your embedding model or your reranker. It is the least glamorous lever and frequently the largest one. Spend your first week of any RAG project here, looking at the actual extracted text, before you touch anything downstream.
The single most useful habit in this whole series. After extraction, print the text and read it with your own eyes — a random sample of twenty chunks. Not the document; the extracted text. You will find merged columns, lost tables, page numbers glued into sentences, and headers repeating every 500 words. Every one of those is a retrieval bug you can fix now or debug painfully later.
Prep is a sequence of narrowing steps, each one turning something messy into something slightly more usable. Picture it as a funnel: raw bytes go in the top, retrieval-ready text comes out the bottom.
| Type | How it breaks | The fix in one line |
|---|---|---|
| Born-digital PDF | Multi-column read straight across; headers/footers in every page's text; ligatures mangled. | Use a layout-aware parser, not a naive text dump. |
| Scanned PDF / image | It's a picture; there is no text to extract at all. | OCR first, then treat as text — and expect errors. |
| HTML | Navigation, ads, cookie banners, and footers outweigh the article. | Extract the main content node; discard the chrome. |
| Tables (in any format) | Linearised into a wordsoup; row/column relationships lost. | Detect tables; serialise each row as a self-describing sentence. |
| Office docs (DOCX, PPTX) | Text is recoverable but structure (headings, slides) is easily lost. | Use a structure-preserving extractor; keep heading hierarchy. |
Here is a realistic extraction step for born-digital PDFs that respects layout and strips the repeating header/footer noise. It uses pdfplumber, which exposes the position of every word so you can reason about columns rather than reading blindly across the page.
# pip install pdfplumber
import pdfplumber
from collections import Counter
def extract_pdf(path):
"""Extract text from a born-digital PDF, column-aware,
with repeating headers/footers removed."""
pages_text = []
line_counter = Counter() # to detect repeating header/footer lines
with pdfplumber.open(path) as pdf:
raw_pages = []
for page in pdf.pages:
# extract_text with layout=True keeps columns from
# being spliced together left-to-right
text = page.extract_text(layout=True) or ""
lines = [ln.strip() for ln in text.split("\n") if ln.strip()]
raw_pages.append(lines)
# count every line across the whole document
for ln in lines:
line_counter[ln] += 1
# a line that appears on most pages is almost certainly
# a running header or footer, not real content
n_pages = len(raw_pages)
boilerplate = {ln for ln, c in line_counter.items()
if c > n_pages * 0.5 and n_pages > 2}
for lines in raw_pages:
kept = [ln for ln in lines if ln not in boilerplate]
pages_text.append("\n".join(kept))
return pages_text, boilerplate
pages, removed = extract_pdf("annual-report.pdf")
print(f"Extracted {len(pages)} pages")
print(f"Removed {len(removed)} boilerplate lines:")
for ln in list(removed)[:3]:
print(f" · {ln!r}")
print("\n--- page 4 preview ---")
print(pages[3][:300])
Running this on a typical 60-page report prints something like:
Extracted 60 pages
Removed 3 boilerplate lines:
· 'ACME Corporation — Annual Report 2025'
· 'Confidential — Do Not Distribute'
· 'Page %d of 60'
--- page 4 preview ---
Revenue grew 18% year over year, driven primarily by the
expansion of the enterprise segment in the EMEA region.
Operating margin improved to 23%, reflecting disciplined
cost management across the business...
Two things earned their keep here. layout=True kept the two-column financial section from being read across into nonsense. And the frequency-based boilerplate detector removed the header and footer that would otherwise appear in every single chunk, polluting embeddings with "Confidential — Do Not Distribute" over and over. Neither fix is clever; both are the difference between usable and unusable text.
HTML is mostly not content. A news article page might be 4% article and 96% navigation, ads, related-links, and footer. Embedding the whole thing buries the signal. The job is to find the main content node and throw away the rest.
# pip install trafilatura
import trafilatura
def extract_article(html):
"""Pull just the main article text out of a web page,
discarding nav, ads, boilerplate."""
# trafilatura is purpose-built for main-content extraction
text = trafilatura.extract(
html,
include_comments=False, # drop comment sections
include_tables=True, # keep tables (we handle them below)
favor_precision=True, # prefer clean over complete
)
return text or ""
with open("blog-post.html", encoding="utf-8") as f:
article = extract_article(f.read())
print(f"Article length: {len(article)} chars")
print(article[:200])
Article length: 5840 chars
Hybrid search combines the precision of keyword matching with
the recall of semantic search. In this post we walk through
why neither approach alone is sufficient for production...
A general-purpose HTML-to-text converter would have returned 60,000 characters here, most of it menu labels and "subscribe to our newsletter." The purpose-built extractor returned 5,840 characters of actual article. That ratio is typical, and it is why "just strip the tags" is the wrong instinct for the web.
Tables are where naive extraction does its quietest damage. Consider a small pricing table:
The principle is to make each row self-describing: fold the column headers into every row so that, no matter how the table gets chunked later, each piece still says what it means.
def serialise_table(headers, rows, caption=""):
"""Turn a table into one self-describing sentence per row.
Each row keeps its column names so it survives chunking."""
sentences = []
for row in rows:
parts = [f"{h}: {v}" for h, v in zip(headers, row)]
line = "; ".join(parts)
if caption:
line = f"[{caption}] {line}"
sentences.append(line + ".")
return "\n".join(sentences)
headers = ["Plan", "Price", "Seats"]
rows = [["Free", "$0", "1"], ["Pro", "$20", "5"], ["Team", "$50", "20"]]
print(serialise_table(headers, rows, caption="Pricing"))
[Pricing] Plan: Free; Price: $0; Seats: 1.
[Pricing] Plan: Pro; Price: $20; Seats: 5.
[Pricing] Plan: Team; Price: $50; Seats: 20.
Now a user asking "how many seats does Pro include?" retrieves a chunk that unambiguously says "Plan: Pro; Price: $20; Seats: 5." The naive version would have retrieved "Pro 20 5" and left the model to guess whether 20 was the price or the seat count. This single technique fixes a large share of "the numbers are wrong" complaints in production RAG.
A scanned PDF has no text — it is an image of text. You must run optical character recognition first. The thing to internalise is that OCR is lossy: it will misread some characters, and those errors flow downstream into your embeddings and answers. Plan for imperfection.
# pip install pytesseract pdf2image (also needs system tesseract + poppler)
import pytesseract
from pdf2image import convert_from_path
def ocr_pdf(path, min_confidence=60):
"""OCR a scanned PDF page by page. Flag low-confidence pages
so a human can review them instead of trusting silently."""
pages = convert_from_path(path, dpi=300) # higher dpi = better OCR
results = []
for i, image in enumerate(pages):
data = pytesseract.image_to_data(
image, output_type=pytesseract.Output.DICT)
words = [w for w, c in zip(data["text"], data["conf"])
if w.strip() and int(c) >= min_confidence]
confidences = [int(c) for c in data["conf"] if int(c) > 0]
avg_conf = sum(confidences) / len(confidences) if confidences else 0
results.append({
"page": i + 1,
"text": " ".join(words),
"avg_confidence": round(avg_conf, 1),
"needs_review": avg_conf < 80, # surface shaky pages
})
return results
pages = ocr_pdf("scanned-contract.pdf")
for p in pages[:3]:
flag = " ⚠ REVIEW" if p["needs_review"] else ""
print(f"Page {p['page']}: confidence {p['avg_confidence']}%{flag}")
Page 1: confidence 94.2%
Page 2: confidence 71.5% ⚠ REVIEW
Page 3: confidence 96.1%
The detail that matters: capturing a confidence score and flagging weak pages. A silent OCR pipeline ingests a 71%-confidence page full of misreadings and serves wrong answers from it forever. A pipeline that surfaces "page 2 is shaky" lets a human glance at it. The cost is a few lines; the payoff is not grounding your system in a machine's misreading of a blurry fax.
During prep you have information you will never be able to recover later, once the document is shredded into chunks. Capture it as structured metadata attached to every chunk:
My take. Metadata feels like bureaucracy at ingestion time and turns out to be the difference between a toy and a product. The single most common regret I hear from teams six months in is "we didn't capture the source position, so now our citations just say 'somewhere in this 200-page PDF.'" Capture more than you think you need. Storage is cheap; re-ingesting fifty million documents is not.
Take five real documents from your domain — ideally a mix: a PDF, a web page, something with a table. Extract the text with any tool. Then read the raw output, all of it, slowly. Write down every defect you spot: merged columns, lost tables, repeating headers, broken characters. This list is your data-prep backlog, and it is almost always longer and more important than people expect.
Find a document with a table that matters — a pricing sheet, a spec table, a comparison. Extract it naively and see the wordsoup. Then serialise its rows into self-describing sentences by hand. Embed both versions (you'll learn how in Chapter 04) and notice which one a relevant query retrieves. The difference is usually stark.
Before you ingest anything for real, write down the exact fields you'll attach to every chunk. Use the list above as a starting point and add whatever your domain needs — author, document category, language, version. Committing to this schema now saves a full re-ingestion later.
Next chapter: Chunking — the hardest problem. Now that you have clean text, how do you split it? The chunk is the unit of retrieval, and how you cut determines what can ever be found. We'll measure the strategies, not just list them.
Sign in to join the discussion and post comments.
Sign in