Building a RAG Pipeline from a PDF Library — Chunking, Embeddings, Retrieval

2026-05-24 · 8 min read

You've converted a stack of PDFs to Markdown and now you want to ask questions across all of them — "what did the 2024 board minutes say about R&D budget?", "which papers cite the Hofmann-Walters method?", "find every contract clause about data residency". The technique that makes this work in 2026 is retrieval-augmented generation (RAG): you embed your documents into a vector index, retrieve the most relevant chunks at query time, and let the LLM answer from those chunks.

This guide walks through the pipeline end to end, with the specific choices that matter for documents that came out of PDFs.

The pipeline at a glance

A working RAG pipeline has six stages:

Convert PDFs to clean Markdown. See the PDF-to-LLM workflow guide for the full conversion side.
Chunk the Markdown into passages of a few hundred tokens.
Embed each chunk into a vector with an embedding model.
Store the vectors in an index (FAISS, Chroma, Pinecone, pgvector, etc.).
Retrieve the top-K chunks at query time by embedding the question and finding nearest neighbors.
Generate an answer with the LLM, passing the retrieved chunks as context.

Most of the failure modes in real-world RAG come from steps 2 and 5. Steps 1, 3, 4, and 6 are well-understood; chunking and retrieval are where decisions matter.

Why PDF-sourced RAG is different

RAG on clean text (web pages, plain-text documentation) works out of the box with most off-the-shelf libraries. RAG on PDF-sourced content has extra failure modes:

Mid-sentence chunk boundaries caused by page breaks in the original PDF. A converter that doesn't strip page numbers leaves "Page 7" floating in the middle of sentences.
Tables flattened into text that no longer look like tables to the embedding model. A row of numbers without column headers is meaningless out of context.
Headers and footers repeated every page bias the embedding toward whatever's in the boilerplate.
Figures and equations that were images in the source PDF become either alt text, OCR'd text, or nothing — each behaves differently in retrieval.
Citation noise in academic PDFs — [12, 13, 14] inline references that have no meaning without the bibliography.

Address these in the conversion step. The cleaner the Markdown, the better every downstream stage works.

Chunking strategy

Chunking is the highest-leverage decision in the pipeline. Three patterns work; one doesn't.

Fixed-token chunks (300-500 tokens, 50-token overlap) — the default in most libraries. Reliable, predictable, no surprises. The 50-token overlap is important: it stops topic transitions from getting cleanly bisected and lost from both chunks.

Heading-aware chunks — split at Markdown headings (#, ##, ###), then sub-split chunks longer than the token cap. This works much better for structured documents like reports, manuals, and academic papers — the headings carry meaning the embedding model can use. This is the right default when your source PDFs have a clear structure.

Sentence or paragraph chunks — split at sentence or paragraph boundaries. Theoretically clean, but produces wildly varying chunk sizes (a one-sentence paragraph vs. a 600-word legal clause), which makes retrieval results unpredictable. Avoid for general use.

Single-page chunks — the anti-pattern. Sounds reasonable ("one chunk per page") but page boundaries in PDFs are almost never semantically meaningful. A sentence routinely splits across pages; a paragraph splits even more often. Don't chunk by page.

A reasonable default in code:

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
heading_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
size_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)

heading_chunks = heading_splitter.split_text(markdown_text)
final_chunks = []
for chunk in heading_chunks:
    final_chunks.extend(size_splitter.split_text(chunk.page_content))

The first pass keeps semantically related text together; the second pass keeps any individual chunk from exceeding the embedding context.

Choosing an embedding model

The embedding model converts text to a vector. The major options in 2026:

OpenAI text-embedding-3-small — 1536 dimensions, cheap (~$0.02 per million tokens), reliable for English and most major languages. The pragmatic default.
OpenAI text-embedding-3-large — 3072 dimensions, modestly better recall on hard queries, ~5× the price. Worth it for high-value retrieval (legal, medical) where missing a relevant chunk is expensive.
voyage-3-large or voyage-3-lite — competitive quality, sometimes better on technical text, separate API.
Cohere embed-multilingual-v3 — strong on non-English, the best off-the-shelf option for mixed-language corpora.
Local: bge-large-en-v1.5 or nomic-embed-text-v1.5 — runnable on a single GPU, good enough for many cases, no per-call cost. The right pick for sensitive documents that can't leave your infrastructure.

A common mistake: re-embedding the entire library every time you tweak the chunking. Store the original chunk text alongside the vector so you can re-embed only when you change the embedding model, not when you change retrieval logic.

Metadata: the unsung hero

The most underrated lever in RAG is metadata filtering. Each chunk should carry:

Source PDF filename and absolute path — so the user can open the original.
Page number — for citations like "see page 47".
Document date — if the corpus spans time. Critical for "what was the policy in 2023?" queries.
Section or chapter title — comes free from the heading-aware chunking step.
Document type or tag — manuals vs. policies vs. meeting notes. Lets the user filter by type.

Storing metadata is cheap; using it well is high-leverage. A query like "find policies from 2024" becomes a metadata filter (type='policy' AND date >= '2024-01-01') plus a semantic search, not pure semantic search over the whole corpus.

Storing vectors

The vector store choices:

FAISS (Facebook AI Similarity Search) — file-based index, fast, no server needed. Good for prototypes and single-user tools. No metadata filtering out of the box; you maintain a separate metadata sidecar.
Chroma — Python-native, file-based or server, metadata filtering built in. The pragmatic default for small-to-medium personal projects.
pgvector (Postgres extension) — if you already have Postgres, this is the lowest-friction option. SQL filters compose naturally with vector search.
Qdrant — open-source, server-based, scales to billions of vectors. Good for production deployments.
Pinecone — managed service, no ops, billed per usage. Reasonable if you're avoiding any infrastructure.

For a personal library of a few thousand PDFs, Chroma is usually right. For organizational deployments with many users and complex permissions, pgvector or Qdrant.

Retrieval

A first-pass retrieval setup:

def retrieve(question, top_k=8):
    q_vec = embed(question)
    candidates = vector_store.search(q_vec, k=top_k)
    return candidates

This works for simple queries. To make it work for hard queries, layer on:

Query rewriting. Send the user's question to a fast LLM with the prompt "rewrite this as 3 different search queries that would surface relevant passages." Embed each rewrite, retrieve top-K for each, union the results. Catches the case where the user's phrasing doesn't match the document's phrasing.

Hybrid search. Combine vector search with keyword search (BM25). Vector search handles paraphrase; BM25 handles rare proper nouns and exact phrases. Most modern vector stores (Qdrant, Elasticsearch, Weaviate) ship hybrid search; for FAISS or Chroma you bolt on a separate BM25 index.

Re-ranking. After retrieving top-30 candidates, run them through a cross-encoder re-ranker (bge-reranker-large, voyage-rerank-1, Cohere Rerank) that scores each candidate against the question. Take the top-8 of the re-ranked list. This dramatically improves precision at the cost of latency.

For most personal RAG projects, hybrid search + re-ranking is the highest-impact addition once basic vector search is working.

Generation

The final stage is feeding the retrieved chunks to the LLM. The prompt that works:

Answer the user's question using ONLY the passages below.
Cite each fact with [source: filename, page N].
If the passages don't contain the answer, say "I don't have that information in the provided documents."
Do not use prior knowledge outside the passages.

Passages:
{retrieved chunks with their metadata}

Question:
{user question}

The "ONLY the passages below" instruction is what stops the model from helpfully hallucinating an answer based on its training data. The citation instruction is what lets the user verify any specific claim — without citations, users have no way to spot when retrieval missed the right chunk.

What goes wrong, and how to debug it

Most RAG problems present as "the answer is wrong" or "the model said it didn't know." The diagnostic path:

Run the question through retrieval only. Did the right chunk come back in the top-K? If not, your retrieval is broken. Try query rewriting, hybrid search, or a different embedding model.
If the right chunk came back, look at what else came back. Are irrelevant chunks crowding out the relevant one? Add re-ranking or increase top-K with re-ranking on top.
If only the right chunks are in context but the answer is still wrong, look at the chunk content. Is the chunk truncated mid-sentence? Did the table get flattened into meaningless text? Fix the chunking or the upstream conversion.
If the chunk is clean and complete and the answer is still wrong, the model misread the chunk. This is rare with frontier models; usually it means the question was ambiguous or the chunk needs more surrounding context.

Build a small evaluation set early: 20–50 question/expected-source pairs you can re-run after every change. Without it, you're tuning blind.

A note on PDFs specifically

The single highest-leverage thing you can do for PDF-based RAG is improve the conversion. A pipeline that runs OCR on every scanned page, preserves tables as structured Markdown, drops repeated headers/footers, and tags each chunk with the source page number outperforms a fancier retrieval setup over messy text every time.

Spend your first day on the chunking and the conversion. Spend your second day on retrieval tuning. Most projects that struggle with RAG quality were under-invested in step 1.

← Back to all guides