You now have millions of vectors. When a question arrives, you embed it and need to find the handful of stored vectors closest to it — and you need to do it in a few milliseconds, not by comparing against every vector one at a time. That is the job of the vector index: a data structure that makes "find the nearest neighbours" fast at a scale where the obvious approach would take minutes per query. This chapter is about how that structure works, the geometry beneath it, and the honest landscape of databases that provide it — including the unfashionable truth that you may not need a dedicated one at all.
The naive way to find the nearest vectors is to compute the similarity between the query and every single stored vector, then take the top few. This is called exact, or brute-force, search. It is perfectly correct and perfectly doomed: its cost grows linearly with the number of vectors. At a thousand chunks it's instant. At fifty million chunks, every query compares against fifty million vectors, and your "few milliseconds" becomes an unacceptable wait — multiplied by every concurrent user.
The escape is to give up a tiny bit of correctness for an enormous amount of speed. Approximate nearest-neighbour (ANN) search finds almost the exact nearest neighbours — say, 98% of the ones brute force would have found — in a thousandth of the time. For RAG, that trade is almost always worth it: missing the occasional borderline chunk costs you far less than making every user wait. The index structures in this chapter are all ways of organising vectors so that ANN search can skip almost all of them.
The dominant index in 2026 is HNSW, which stands for Hierarchical Navigable Small World. The name is intimidating; the idea is not. Imagine every vector is a city, and you draw roads connecting each city to its nearest neighbours. To find the city closest to a new point, you start anywhere and keep driving to whichever connected city is closer to your target, until no neighbour is closer. You've arrived, having visited a dozen cities instead of all fifty million.
The "hierarchical" part adds express layers on top, like a flight network above the roads: a sparse top layer lets you jump across the whole space in a few hops to get roughly close, then you descend to denser layers for local precision. Start with a long flight, finish with short drives. That layering is what keeps search fast even as the number of vectors grows enormous — search time grows with the logarithm of the corpus size, not linearly.
HNSW's trade-offs, plainly: it is fast and accurate, which is why it's everywhere. It uses a fair amount of memory (it stores all those connections), and it is awkward to update with heavy deletions. For most RAG systems those costs are acceptable and HNSW is the right default.
The other common approach, IVF (Inverted File index), works by clustering. Ahead of time, it groups all vectors into, say, a thousand clusters, each with a representative centre. At query time, it finds the few cluster centres closest to the query, then searches only the vectors in those clusters — ignoring the other 990-odd clusters entirely. You skip most of the corpus by deciding, up front, which neighbourhoods are worth visiting.
IVF uses less memory than HNSW and builds quickly, but it can be less accurate near cluster boundaries — a relevant vector sitting just across a cluster line from the query may be missed because its cluster wasn't searched. You tune how many clusters to probe: probe more for accuracy, fewer for speed. IVF shines at very large scale where HNSW's memory appetite becomes a problem.
Here is a distinction Chapter 01 insisted on, now made concrete. HNSW and IVF are index algorithms — the mathematics of organising vectors for fast search. A vector database is the production system wrapped around an index: it handles storage, replication, backups, metadata filtering, access control, concurrent updates, and an API. The index finds neighbours; the database makes that capability survivable in production.
Why care? Because when something goes wrong, the layer that's failing determines the fix. Slow search at small scale? That's an index-tuning problem. Search works but the service falls over under load, or you can't restore after an outage? Those are database problems, and no amount of index tuning touches them. Teams that conflate the two waste days tuning HNSW parameters when their actual problem is that they have no replication.
| Option | Shape | Reach for it when |
|---|---|---|
| pgvector | An extension to PostgreSQL. | You already run Postgres and have up to a few million vectors. Your vectors live beside your relational data; one database to operate. The pragmatic default for most teams. |
| Qdrant | Open-source, purpose-built, Rust. | You want a dedicated vector DB with strong metadata filtering, self-hosted or managed, without huge operational weight. |
| Pinecone | Fully managed, closed-source. | You want zero infrastructure to operate and will pay for that convenience. Fast to start; you're renting, not owning. |
| Weaviate | Open-source, feature-rich. | You want built-in hybrid search and modules, and don't mind more moving parts. |
| Milvus | Open-source, built for huge scale. | You have hundreds of millions of vectors and the team to run distributed infrastructure. |
| Elasticsearch / OpenSearch | Search engine with vector support. | You already run it for keyword search and want hybrid in one system (handy for Chapter 06). |
My take. Most teams reach for a dedicated vector database far too early. If you already run Postgres and have fewer than a few million chunks — which describes the large majority of real projects — pgvector is almost certainly the right answer. It keeps your vectors next to your application data, it's one less system to back up and secure, and it is plenty fast at that scale. Graduate to a dedicated vector database when you have measured a real reason to: tens of millions of vectors, or a filtering or scale need pgvector genuinely can't meet. "It feels more serious" is not a reason.
Here is the whole loop with pgvector: create a table with a vector column, build an HNSW index, insert chunks, and query. This is a complete, production-shaped starting point — not a toy.
# pip install psycopg sentence-transformers pgvector
import psycopg
from pgvector.psycopg import register_vector
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-base-en-v1.5") # 768-dim
conn = psycopg.connect("postgresql://localhost/ragdb")
register_vector(conn)
# 1. table + HNSW index. vector(768) matches the model's dimension.
conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.execute("""
CREATE TABLE IF NOT EXISTS chunks (
id bigserial PRIMARY KEY,
doc_id text,
content text,
embedding vector(768)
)
""")
# cosine distance index; m and ef_construction trade build cost for quality
conn.execute("""
CREATE INDEX IF NOT EXISTS chunks_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
""")
# 2. insert a chunk (normalise so cosine behaves)
def add_chunk(doc_id, content):
vec = model.encode(content, normalize_embeddings=True)
conn.execute(
"INSERT INTO chunks (doc_id, content, embedding) VALUES (%s, %s, %s)",
(doc_id, content, vec))
# 3. query: nearest neighbours by cosine distance (<=> operator)
def search(question, k=5):
q = model.encode(question, normalize_embeddings=True)
rows = conn.execute("""
SELECT content, 1 - (embedding <=> %s) AS similarity
FROM chunks
ORDER BY embedding <=> %s
LIMIT %s
""", (q, q, k)).fetchall()
return rows
add_chunk("doc1", "To cancel your subscription, open Account > Billing.")
add_chunk("doc1", "Refunds are issued within 30 days of purchase.")
conn.commit()
for content, sim in search("how do I stop being billed", k=2):
print(f"{sim:.3f} {content}")
0.612 To cancel your subscription, open Account > Billing.
0.287 Refunds are issued within 30 days of purchase.
Notice the query "how do I stop being billed" retrieved the cancellation chunk first, with a clearly higher similarity, even though it shares almost no words with it — that is semantic search via the embedding doing its job, and the HNSW index finding it fast. The <=> operator is pgvector's cosine distance; 1 - distance converts it back to the cosine similarity from Chapter 01.
Choosing an index is the fun part; operating it is the job. A few realities to plan for, because they don't appear in quickstart guides:
ef and IVF's probe count directly trade recall for speed. Leaving them at defaults without measuring means you're either slower or less accurate than you needed to be. Tune against your eval set.Multiply your chunk count by your embedding dimension by 4 bytes (for float32), then add roughly 50% for HNSW graph overhead. Does the result fit comfortably in the RAM of the machine you're planning to use? This one calculation tells you whether you're in "pgvector on a small box" territory or "we need to think about quantization and dedicated infrastructure" territory.
If you have Postgres available, run the code above on fifty of your real chunks. Issue a few queries whose answers you know. Are the right chunks coming back at the top? You now have a working retrieval core in about forty lines — everything else in this series makes it better, but this is the heart.
Add a metadata column (say, a category or a date) to the table, populate it, and run a vector search that also filters on it. Time it against the unfiltered version. The difference is your first taste of the operational reality that decides database choice at scale.
Next chapter: Retrieval algorithms — vector, lexical, hybrid. Semantic search is powerful but it isn't the whole story — sometimes the exact keyword matters more than the meaning. We'll compare pure vector, keyword (BM25), and hybrid search with real numbers, and settle when each one wins.
Sign in to join the discussion and post comments.
Sign in