On this tutorial

RAG: A Field Manual for Building LLM Systems That Use Your Data

Foundations

Data

Retrieval

Generation & Evaluation

Specialized verticals

Production

Closing

The vector index — databases and the geometry

You now have millions of vectors. When a question arrives, you embed it and need to find the handful of stored vectors closest to it — and you need to do it in a few milliseconds, not by comparing against every vector one at a time. That is the job of the vector index: a data structure that makes "find the nearest neighbours" fast at a scale where the obvious approach would take minutes per query. This chapter is about how that structure works, the geometry beneath it, and the honest landscape of databases that provide it — including the unfashionable truth that you may not need a dedicated one at all.

What you'll take away from this chapter

Why exact nearest-neighbour search doesn't scale, and what "approximate" buys you
How the two dominant index types — HNSW and IVF — actually work, pictured rather than hand-waved
The real distinction between an index and a vector database, and why it matters when things break
An honest tour of the main vector databases, and when each one earns its place
The case — more common than vendors admit — for not using a dedicated vector database at all

Why you can't just compare everything

The naive way to find the nearest vectors is to compute the similarity between the query and every single stored vector, then take the top few. This is called exact, or brute-force, search. It is perfectly correct and perfectly doomed: its cost grows linearly with the number of vectors. At a thousand chunks it's instant. At fifty million chunks, every query compares against fifty million vectors, and your "few milliseconds" becomes an unacceptable wait — multiplied by every concurrent user.

The escape is to give up a tiny bit of correctness for an enormous amount of speed. Approximate nearest-neighbour (ANN) search finds almost the exact nearest neighbours — say, 98% of the ones brute force would have found — in a thousandth of the time. For RAG, that trade is almost always worth it: missing the occasional borderline chunk costs you far less than making every user wait. The index structures in this chapter are all ways of organising vectors so that ANN search can skip almost all of them.

Brute force compares the query against every point. Approximate search navigates a structure to hop straight to the right neighbourhood, checking only a handful of points. It might miss one borderline neighbour — and for RAG that's a fine trade for being a thousand times faster.

HNSW — the graph you walk

The dominant index in 2026 is HNSW, which stands for Hierarchical Navigable Small World. The name is intimidating; the idea is not. Imagine every vector is a city, and you draw roads connecting each city to its nearest neighbours. To find the city closest to a new point, you start anywhere and keep driving to whichever connected city is closer to your target, until no neighbour is closer. You've arrived, having visited a dozen cities instead of all fifty million.

The "hierarchical" part adds express layers on top, like a flight network above the roads: a sparse top layer lets you jump across the whole space in a few hops to get roughly close, then you descend to denser layers for local precision. Start with a long flight, finish with short drives. That layering is what keeps search fast even as the number of vectors grows enormous — search time grows with the logarithm of the corpus size, not linearly.

HNSW's trade-offs, plainly: it is fast and accurate, which is why it's everywhere. It uses a fair amount of memory (it stores all those connections), and it is awkward to update with heavy deletions. For most RAG systems those costs are acceptable and HNSW is the right default.

IVF — the neighbourhoods you pre-sort

The other common approach, IVF (Inverted File index), works by clustering. Ahead of time, it groups all vectors into, say, a thousand clusters, each with a representative centre. At query time, it finds the few cluster centres closest to the query, then searches only the vectors in those clusters — ignoring the other 990-odd clusters entirely. You skip most of the corpus by deciding, up front, which neighbourhoods are worth visiting.

IVF uses less memory than HNSW and builds quickly, but it can be less accurate near cluster boundaries — a relevant vector sitting just across a cluster line from the query may be missed because its cluster wasn't searched. You tune how many clusters to probe: probe more for accuracy, fewer for speed. IVF shines at very large scale where HNSW's memory appetite becomes a problem.

The index is not the database

Here is a distinction Chapter 01 insisted on, now made concrete. HNSW and IVF are index algorithms — the mathematics of organising vectors for fast search. A vector database is the production system wrapped around an index: it handles storage, replication, backups, metadata filtering, access control, concurrent updates, and an API. The index finds neighbours; the database makes that capability survivable in production.

Why care? Because when something goes wrong, the layer that's failing determines the fix. Slow search at small scale? That's an index-tuning problem. Search works but the service falls over under load, or you can't restore after an outage? Those are database problems, and no amount of index tuning touches them. Teams that conflate the two waste days tuning HNSW parameters when their actual problem is that they have no replication.

An honest tour of the databases

Option	Shape	Reach for it when
pgvector	An extension to PostgreSQL.	You already run Postgres and have up to a few million vectors. Your vectors live beside your relational data; one database to operate. The pragmatic default for most teams.
Qdrant	Open-source, purpose-built, Rust.	You want a dedicated vector DB with strong metadata filtering, self-hosted or managed, without huge operational weight.
Pinecone	Fully managed, closed-source.	You want zero infrastructure to operate and will pay for that convenience. Fast to start; you're renting, not owning.
Weaviate	Open-source, feature-rich.	You want built-in hybrid search and modules, and don't mind more moving parts.
Milvus	Open-source, built for huge scale.	You have hundreds of millions of vectors and the team to run distributed infrastructure.
Elasticsearch / OpenSearch	Search engine with vector support.	You already run it for keyword search and want hybrid in one system (handy for Chapter 06).

My take. Most teams reach for a dedicated vector database far too early. If you already run Postgres and have fewer than a few million chunks — which describes the large majority of real projects — pgvector is almost certainly the right answer. It keeps your vectors next to your application data, it's one less system to back up and secure, and it is plenty fast at that scale. Graduate to a dedicated vector database when you have measured a real reason to: tens of millions of vectors, or a filtering or scale need pgvector genuinely can't meet. "It feels more serious" is not a reason.

A working pgvector example

Here is the whole loop with pgvector: create a table with a vector column, build an HNSW index, insert chunks, and query. This is a complete, production-shaped starting point — not a toy.

# pip install psycopg sentence-transformers pgvector
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-base-en-v1.5")  # 768-dim
conn = psycopg.connect("postgresql://localhost/ragdb")
register_vector(conn)

# 1. table + HNSW index. vector(768) matches the model's dimension.
conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.execute("""
    CREATE TABLE IF NOT EXISTS chunks (
        id      bigserial PRIMARY KEY,
        doc_id  text,
        content text,
        embedding vector(768)
    )
""")
# cosine distance index; m and ef_construction trade build cost for quality
conn.execute("""
    CREATE INDEX IF NOT EXISTS chunks_hnsw
    ON chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")

# 2. insert a chunk (normalise so cosine behaves)
def add_chunk(doc_id, content):
    vec = model.encode(content, normalize_embeddings=True)
    conn.execute(
        "INSERT INTO chunks (doc_id, content, embedding) VALUES (%s, %s, %s)",
        (doc_id, content, vec))

# 3. query: nearest neighbours by cosine distance (<=> operator)
def search(question, k=5):
    q = model.encode(question, normalize_embeddings=True)
    rows = conn.execute("""
        SELECT content, 1 - (embedding <=> %s) AS similarity
        FROM chunks
        ORDER BY embedding <=> %s
        LIMIT %s
    """, (q, q, k)).fetchall()
    return rows

add_chunk("doc1", "To cancel your subscription, open Account > Billing.")
add_chunk("doc1", "Refunds are issued within 30 days of purchase.")
conn.commit()

for content, sim in search("how do I stop being billed", k=2):
    print(f"{sim:.3f}  {content}")

0.612  To cancel your subscription, open Account > Billing.
0.287  Refunds are issued within 30 days of purchase.

Notice the query "how do I stop being billed" retrieved the cancellation chunk first, with a clearly higher similarity, even though it shares almost no words with it — that is semantic search via the embedding doing its job, and the HNSW index finding it fast. The <=> operator is pgvector's cosine distance; 1 - distance converts it back to the cosine similarity from Chapter 01.

The operational reality vendors skip

Choosing an index is the fun part; operating it is the job. A few realities to plan for, because they don't appear in quickstart guides:

Backup and restore. Can you restore your index after a failure, and how long does it take? With pgvector, your normal Postgres backups cover it. With some dedicated databases, the back/restore story is more involved than the "getting started" page suggests. Test a restore before you need one.
Metadata filtering. Real queries are "find chunks about billing from documents this user can see, updated this year." Filtering vector search by metadata is where databases differ sharply in both capability and speed. Test it on your real filters, not the demo's.
Updates and deletes. HNSW handles inserts gracefully but degrades with heavy deletions over time, sometimes needing a rebuild. If your corpus churns a lot, ask how the database handles it before you commit.
Memory. HNSW indexes are memory-hungry. A back-of-envelope estimate (vector count × dimension × bytes-per-number, plus graph overhead) tells you whether your index fits in RAM — and whether quantization from Chapter 04 just became necessary.

When this fails

Reaching for a dedicated DB too early. A new vector database is an entire new system to secure, back up, monitor, and pay for. At small scale, pgvector or even an in-memory index is simpler and just as fast. Don't take on the operational burden until scale demands it.
Tuning the index when the database is the problem. Endless HNSW parameter tweaking won't fix a service that falls over under load or can't restore. Diagnose which layer is failing first.
Ignoring metadata filtering until late. A database that's fast at pure vector search can crawl when you add a strict metadata filter, because the filter and the index fight each other. Test filtered queries early; they're what production actually runs.
Setting ef/probe blindly. HNSW's search-time ef and IVF's probe count directly trade recall for speed. Leaving them at defaults without measuring means you're either slower or less accurate than you needed to be. Tune against your eval set.
Letting deletions rot the index. A high-churn corpus on an un-maintained HNSW index slowly loses recall as tombstoned nodes accumulate. Schedule rebuilds if you delete a lot.

Practice — before you read the next chapter

Estimate your index size

Multiply your chunk count by your embedding dimension by 4 bytes (for float32), then add roughly 50% for HNSW graph overhead. Does the result fit comfortably in the RAM of the machine you're planning to use? This one calculation tells you whether you're in "pgvector on a small box" territory or "we need to think about quantization and dedicated infrastructure" territory.

Run the pgvector loop

If you have Postgres available, run the code above on fifty of your real chunks. Issue a few queries whose answers you know. Are the right chunks coming back at the top? You now have a working retrieval core in about forty lines — everything else in this series makes it better, but this is the heart.

Test a filtered query

Add a metadata column (say, a category or a date) to the table, populate it, and run a vector search that also filters on it. Time it against the unfiltered version. The difference is your first taste of the operational reality that decides database choice at scale.

Takeaways

Exact search doesn't scale; approximate nearest-neighbour trades a sliver of accuracy for a thousandfold speed-up, which is the right trade for RAG.
HNSW is a navigable graph with express layers — fast, accurate, memory-hungry, the sensible default. IVF clusters vectors and searches only nearby clusters — lighter, great at very large scale.
The index is the algorithm; the database is the production system around it. When something breaks, identify which layer is failing.
pgvector is the right default for most teams — already-running Postgres, up to a few million vectors. Graduate to a dedicated database only when measured need demands it.
Plan for the operational realities — backup/restore, metadata filtering, deletions, memory — because they, not the index math, decide your database choice.

Next chapter: Retrieval algorithms — vector, lexical, hybrid. Semantic search is powerful but it isn't the whole story — sometimes the exact keyword matters more than the meaning. We'll compare pure vector, keyword (BM25), and hybrid search with real numbers, and settle when each one wins.

Discussion

Embeddings — picking, evaluating, migrating Retrieval algorithms — vector, lexical, hybrid