Building a RAG Pipeline from a PDF Library — Chunking, Embeddings, Retrieval

You've converted a stack of PDFs to Markdown and now you want to ask questions across all of them — "what did the 2024 board minutes say about R&D budget?", "which papers cite the Hofmann-Walters method?", "find every contract clause about data residency". The technique that makes this work in 2026 is retrieval-augmented generation (RAG): you embed your documents into a vector index, retrieve the most relevant chunks at query time, and let the LLM answer from those chunks.

This guide walks through the pipeline end to end, with the specific choices that matter for documents that came out of PDFs.

The pipeline at a glance

A working RAG pipeline has six stages:

  1. Convert PDFs to clean Markdown. See the PDF-to-LLM workflow guide for the full conversion side.
  2. Chunk the Markdown into passages of a few hundred tokens.
  3. Embed each chunk into a vector with an embedding model.
  4. Store the vectors in an index (FAISS, Chroma, Pinecone, pgvector, etc.).
  5. Retrieve the top-K chunks at query time by embedding the question and finding nearest neighbors.
  6. Generate an answer with the LLM, passing the retrieved chunks as context.

Most of the failure modes in real-world RAG come from steps 2 and 5. Steps 1, 3, 4, and 6 are well-understood; chunking and retrieval are where decisions matter.

Why PDF-sourced RAG is different

RAG on clean text (web pages, plain-text documentation) works out of the box with most off-the-shelf libraries. RAG on PDF-sourced content has extra failure modes:

Address these in the conversion step. The cleaner the Markdown, the better every downstream stage works.

Chunking strategy

Chunking is the highest-leverage decision in the pipeline. Three patterns work; one doesn't.

Fixed-token chunks (300-500 tokens, 50-token overlap) — the default in most libraries. Reliable, predictable, no surprises. The 50-token overlap is important: it stops topic transitions from getting cleanly bisected and lost from both chunks.

Heading-aware chunks — split at Markdown headings (#, ##, ###), then sub-split chunks longer than the token cap. This works much better for structured documents like reports, manuals, and academic papers — the headings carry meaning the embedding model can use. This is the right default when your source PDFs have a clear structure.

Sentence or paragraph chunks — split at sentence or paragraph boundaries. Theoretically clean, but produces wildly varying chunk sizes (a one-sentence paragraph vs. a 600-word legal clause), which makes retrieval results unpredictable. Avoid for general use.

Single-page chunks — the anti-pattern. Sounds reasonable ("one chunk per page") but page boundaries in PDFs are almost never semantically meaningful. A sentence routinely splits across pages; a paragraph splits even more often. Don't chunk by page.

A reasonable default in code:

from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
heading_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
size_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)

heading_chunks = heading_splitter.split_text(markdown_text)
final_chunks = []
for chunk in heading_chunks:
    final_chunks.extend(size_splitter.split_text(chunk.page_content))

The first pass keeps semantically related text together; the second pass keeps any individual chunk from exceeding the embedding context.

Choosing an embedding model

The embedding model converts text to a vector. The major options in 2026:

A common mistake: re-embedding the entire library every time you tweak the chunking. Store the original chunk text alongside the vector so you can re-embed only when you change the embedding model, not when you change retrieval logic.

Metadata: the unsung hero

The most underrated lever in RAG is metadata filtering. Each chunk should carry:

Storing metadata is cheap; using it well is high-leverage. A query like "find policies from 2024" becomes a metadata filter (type='policy' AND date >= '2024-01-01') plus a semantic search, not pure semantic search over the whole corpus.

Storing vectors

The vector store choices:

For a personal library of a few thousand PDFs, Chroma is usually right. For organizational deployments with many users and complex permissions, pgvector or Qdrant.

Retrieval

A first-pass retrieval setup:

def retrieve(question, top_k=8):
    q_vec = embed(question)
    candidates = vector_store.search(q_vec, k=top_k)
    return candidates

This works for simple queries. To make it work for hard queries, layer on:

Query rewriting. Send the user's question to a fast LLM with the prompt "rewrite this as 3 different search queries that would surface relevant passages." Embed each rewrite, retrieve top-K for each, union the results. Catches the case where the user's phrasing doesn't match the document's phrasing.

Hybrid search. Combine vector search with keyword search (BM25). Vector search handles paraphrase; BM25 handles rare proper nouns and exact phrases. Most modern vector stores (Qdrant, Elasticsearch, Weaviate) ship hybrid search; for FAISS or Chroma you bolt on a separate BM25 index.

Re-ranking. After retrieving top-30 candidates, run them through a cross-encoder re-ranker (bge-reranker-large, voyage-rerank-1, Cohere Rerank) that scores each candidate against the question. Take the top-8 of the re-ranked list. This dramatically improves precision at the cost of latency.

For most personal RAG projects, hybrid search + re-ranking is the highest-impact addition once basic vector search is working.

Generation

The final stage is feeding the retrieved chunks to the LLM. The prompt that works:

Answer the user's question using ONLY the passages below.
Cite each fact with [source: filename, page N].
If the passages don't contain the answer, say "I don't have that information in the provided documents."
Do not use prior knowledge outside the passages.

Passages:
{retrieved chunks with their metadata}

Question:
{user question}

The "ONLY the passages below" instruction is what stops the model from helpfully hallucinating an answer based on its training data. The citation instruction is what lets the user verify any specific claim — without citations, users have no way to spot when retrieval missed the right chunk.

What goes wrong, and how to debug it

Most RAG problems present as "the answer is wrong" or "the model said it didn't know." The diagnostic path:

  1. Run the question through retrieval only. Did the right chunk come back in the top-K? If not, your retrieval is broken. Try query rewriting, hybrid search, or a different embedding model.
  2. If the right chunk came back, look at what else came back. Are irrelevant chunks crowding out the relevant one? Add re-ranking or increase top-K with re-ranking on top.
  3. If only the right chunks are in context but the answer is still wrong, look at the chunk content. Is the chunk truncated mid-sentence? Did the table get flattened into meaningless text? Fix the chunking or the upstream conversion.
  4. If the chunk is clean and complete and the answer is still wrong, the model misread the chunk. This is rare with frontier models; usually it means the question was ambiguous or the chunk needs more surrounding context.

Build a small evaluation set early: 20–50 question/expected-source pairs you can re-run after every change. Without it, you're tuning blind.

A note on PDFs specifically

The single highest-leverage thing you can do for PDF-based RAG is improve the conversion. A pipeline that runs OCR on every scanned page, preserves tables as structured Markdown, drops repeated headers/footers, and tags each chunk with the source page number outperforms a fancier retrieval setup over messy text every time.

Spend your first day on the chunking and the conversion. Spend your second day on retrieval tuning. Most projects that struggle with RAG quality were under-invested in step 1.

← Back to all guides