Building an End-to-End PDF-to-LLM Workflow for Research and Knowledge Work

The dream: drop PDFs into a folder, ask questions, get answers grounded in the documents with citations. The reality: there are five stages between "PDF on disk" and "useful LLM answer", and each one fails differently.

This article walks through the full pipeline with concrete tool choices, the tradeoffs that actually matter, and code sketches for each stage. The goal is a workflow you can build in a weekend that scales to thousands of documents.

The five stages

A quick overview before the deep dive:

  1. Conversion — PDFs become Markdown or plain text
  2. Cleanup — fix structural issues, normalize headers, strip junk
  3. Chunking — break long documents into model-friendly chunks
  4. Embedding and indexing — represent chunks numerically, store them in a vector database
  5. Retrieval and prompting — at query time, fetch the right chunks and ask the model

Most teams nail one or two of these and ship something mediocre. Doing all five well is the difference between a demo and a tool people actually use.

Stage 1: Conversion

The foundation of the whole pipeline. Bad input here cascades into every downstream stage.

For digital PDFs — modern reports, papers, manuals — use pymupdf4llm, marker, or the converter on this site. For scanned PDFs, add OCR fallback; see converting scanned PDFs to text. For PDFs where tables carry critical information, route table-heavy pages through AWS Textract or a vision model; see preserving tables when converting PDF to Markdown.

Output format matters: Markdown beats plain text for almost every downstream stage. The structure (headings, lists, tables) feeds directly into the next stages. See Markdown vs plain text for LLMs for the full case.

Realistic per-document time:

For batches, see bulk PDF conversion.

Stage 2: Cleanup

The most-skipped stage and one of the most valuable. The 80/20 of cleanup:

For a corpus of 10 documents, do this by hand in 50 minutes. For 1000+, write a script. The investment pays off in better retrieval quality at every subsequent stage.

Stage 3: Chunking

The most underestimated stage. Bad chunks = bad retrieval = bad answers, no matter how good your model is.

Chunking strategies, from simplest to most sophisticated:

For converted Markdown, the recommended default is heading-aware chunking with a ~1000-token maximum and ~100-token overlap.

Common pitfalls:

The prepended-heading trick is cheap (a few extra tokens per chunk) and improves retrieval quality more than any other single change.

Stage 4: Embedding and indexing

Embedding model choices:

Vector store choices:

The most common mistake here is over-engineering. Most personal and team RAG projects fit comfortably in FAISS or sqlite-vec on a single machine. The managed services add ops overhead that doesn't pay off until you're well past 1M chunks.

Code sketch for the simple local path:

import faiss
import openai
import numpy as np

client = openai.OpenAI()

def embed(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return np.array([d.embedding for d in resp.data])

# Index chunks
embeddings = embed(chunks)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype("float32"))

# Persist
faiss.write_index(index, "chunks.faiss")

Stage 5: Retrieval and prompting

At query time:

  1. Embed the user's question with the same model used to embed the chunks.
  2. Retrieve the top-K most similar chunks (typically K = 5–10).
  3. Optionally re-rank with a cross-encoder for better precision (Cohere Rerank, BGE Reranker).
  4. Build a prompt that includes the retrieved chunks and the question.
  5. Ask the model.

A prompt template that works well:

You are answering questions based on the following document excerpts.

[Excerpt 1 — from {source}, section "{heading}"]
{content}

[Excerpt 2 — ...]
...

Question: {user question}

Instructions:
- Answer only using information from the excerpts above.
- Quote relevant passages and cite the source and section.
- If the excerpts don't contain the answer, say so explicitly.

The "cite the source and section" instruction is the difference between a useful answer and a black-box guess. Users need to verify; citations make verification possible.

Common failure modes:

For production use, plan to spend more time tuning retrieval than tuning any other stage. The right K, the right reranker, the right chunking strategy interact in ways you can only discover with real queries.

Hybrid retrieval

Pure vector search misses keyword matches. Searching for "ARP-2024-118" might miss a chunk that contains exactly that string because the embedding isn't semantically close to anything else in the corpus.

The fix is hybrid retrieval: combine vector search with BM25 (a classic keyword-search algorithm). Run both queries, merge the results with reciprocal rank fusion.

Most production RAG systems use hybrid retrieval. Libraries that bundle it: LlamaIndex, Haystack, Vespa. Rolling your own is straightforward — run a BM25 search via rank_bm25 alongside the vector search, then merge.

A realistic cost model

For a 10,000-document corpus (averaging 20 pages each):

Total to bootstrap a 10k-doc knowledge base: under $100. Ongoing costs are dominated by query volume — at 1000 queries per day, plan for $10–50/day in model costs.

Conclusion

Each stage looks simple in isolation and gets ugly when you assemble them. The pipeline only works as well as its weakest stage.

Start with the highest-quality conversion you can get. PDFs are the foundation of everything downstream, and a 90% accurate conversion turns into 80% useful chunks, which turns into 70% useful answers. Get conversion right first.

For getting your first PDFs into Markdown, the converter on this site is a no-install starting point. Once you've validated the per-document output, move to a local script for batch processing.

← Back to all guides