Building an End-to-End PDF-to-LLM Workflow for Research and Knowledge Work

2026-05-04 · 7 min read

The dream: drop PDFs into a folder, ask questions, get answers grounded in the documents with citations. The reality: there are five stages between "PDF on disk" and "useful LLM answer", and each one fails differently.

This article walks through the full pipeline with concrete tool choices, the tradeoffs that actually matter, and code sketches for each stage. The goal is a workflow you can build in a weekend that scales to thousands of documents.

The five stages

A quick overview before the deep dive:

Conversion — PDFs become Markdown or plain text
Cleanup — fix structural issues, normalize headers, strip junk
Chunking — break long documents into model-friendly chunks
Embedding and indexing — represent chunks numerically, store them in a vector database
Retrieval and prompting — at query time, fetch the right chunks and ask the model

Most teams nail one or two of these and ship something mediocre. Doing all five well is the difference between a demo and a tool people actually use.

Stage 1: Conversion

The foundation of the whole pipeline. Bad input here cascades into every downstream stage.

For digital PDFs — modern reports, papers, manuals — use pymupdf4llm, marker, or the converter on this site. For scanned PDFs, add OCR fallback; see converting scanned PDFs to text. For PDFs where tables carry critical information, route table-heavy pages through AWS Textract or a vision model; see preserving tables when converting PDF to Markdown.

Output format matters: Markdown beats plain text for almost every downstream stage. The structure (headings, lists, tables) feeds directly into the next stages. See Markdown vs plain text for LLMs for the full case.

Realistic per-document time:

Digital PDF: 2–30 seconds
Scanned PDF with Tesseract OCR: 1–5 minutes
Mixed content with vision-model OCR: 30–120 seconds per page

For batches, see bulk PDF conversion.

Stage 2: Cleanup

The most-skipped stage and one of the most valuable. The 80/20 of cleanup:

Strip running headers and footers. Any line that appears on more than 50% of pages is page chrome, not content. A regex catches them.
Promote correctly-detected titles to H1. Converters sometimes emit the document title as bold body text. Fix it so chunking can use the title as context.
Fix mis-leveled headings. If the converter promoted body text to H2, demote it. Headings drive chunking later.
Fix broken hyphenation across line breaks. auto-\nmation → automation. A regex on \w-\n\w covers the common case.
Normalize whitespace. Collapse runs of three or more blank lines to two.
Drop tables of contents and indexes. They confuse retrieval — they're lists of section names, not content about those sections.
Preserve metadata at the top of each file. Source filename, conversion date, page count. Useful for citations later.

For a corpus of 10 documents, do this by hand in 50 minutes. For 1000+, write a script. The investment pays off in better retrieval quality at every subsequent stage.

Stage 3: Chunking

The most underestimated stage. Bad chunks = bad retrieval = bad answers, no matter how good your model is.

Chunking strategies, from simplest to most sophisticated:

Fixed-size chunks (e.g., 500 tokens with 50-token overlap). Simple, ignores document structure, works OK on uniform content.
Heading-aware chunking. Split on H2 and H3 boundaries. Chunks align with semantic sections. The best default for converted Markdown.
Recursive chunking (LangChain's RecursiveCharacterTextSplitter). Tries large boundaries first (paragraphs), falls back to smaller boundaries (sentences) only when a chunk would be too long. Good general-purpose strategy.
Semantic chunking. Use a small model to detect topic shifts and split on those boundaries. Highest quality, highest cost, slowest setup.
Page-level chunks. One chunk per PDF page. Useful when you need page-number citations in answers.

For converted Markdown, the recommended default is heading-aware chunking with a ~1000-token maximum and ~100-token overlap.

Common pitfalls:

Chunks that split mid-table. Table headers get separated from their rows; retrieval surfaces orphan rows that mean nothing. Fix: treat tables as atomic — never split across a chunk boundary.
Chunks that split mid-list. Bullet points become orphaned context-free fragments. Fix: keep lists intact.
Chunks that lose parent heading context. A chunk reads "studies show no significant effect" without telling you what was being studied. Fix: prepend the chunk's parent heading hierarchy ("Document Title > Methods > Statistical Analysis") to every chunk.

The prepended-heading trick is cheap (a few extra tokens per chunk) and improves retrieval quality more than any other single change.

Stage 4: Embedding and indexing

Embedding model choices:

OpenAI text-embedding-3-large — high quality, $0.13 per million tokens. Default if you want a single model that just works.
OpenAI text-embedding-3-small — good enough for most uses at $0.02 per million tokens. Five times cheaper, ~95% of the quality.
Voyage AI voyage-3 — competitive with OpenAI, sometimes better on technical content. Worth benchmarking on your specific corpus.
Local: BGE or Nomic embeddings — free, runs on a single GPU, slightly lower quality. Worth it if you're processing sensitive content or have steady volume.

Vector store choices:

Under 100k chunks: a local FAISS or sqlite-vec index. Runs on a laptop. Don't reach for a managed service yet.
Under 10M chunks: Qdrant, Weaviate, or Chroma. Single-node, fast, free to self-host.
Larger than that: managed services (Pinecone, Turbopuffer) or self-hosted clusters.

The most common mistake here is over-engineering. Most personal and team RAG projects fit comfortably in FAISS or sqlite-vec on a single machine. The managed services add ops overhead that doesn't pay off until you're well past 1M chunks.

Code sketch for the simple local path:

import faiss
import openai
import numpy as np

client = openai.OpenAI()

def embed(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return np.array([d.embedding for d in resp.data])

# Index chunks
embeddings = embed(chunks)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype("float32"))

# Persist
faiss.write_index(index, "chunks.faiss")

Stage 5: Retrieval and prompting

At query time:

Embed the user's question with the same model used to embed the chunks.
Retrieve the top-K most similar chunks (typically K = 5–10).
Optionally re-rank with a cross-encoder for better precision (Cohere Rerank, BGE Reranker).
Build a prompt that includes the retrieved chunks and the question.
Ask the model.

A prompt template that works well:

You are answering questions based on the following document excerpts.

[Excerpt 1 — from {source}, section "{heading}"]
{content}

[Excerpt 2 — ...]
...

Question: {user question}

Instructions:
- Answer only using information from the excerpts above.
- Quote relevant passages and cite the source and section.
- If the excerpts don't contain the answer, say so explicitly.

The "cite the source and section" instruction is the difference between a useful answer and a black-box guess. Users need to verify; citations make verification possible.

Common failure modes:

Top-K too small → the model misses relevant context. Bump from 5 to 10.
Top-K too large → token budget blown, model gets distracted. Reduce or add reranking.
No reranking → query matches noisy chunks that share keywords but not meaning. Add a cross-encoder reranker for the top 50, return the top 10.
Citations missing → users can't verify; trust collapses. Always include source identifiers in the prompt.

For production use, plan to spend more time tuning retrieval than tuning any other stage. The right K, the right reranker, the right chunking strategy interact in ways you can only discover with real queries.

Hybrid retrieval

Pure vector search misses keyword matches. Searching for "ARP-2024-118" might miss a chunk that contains exactly that string because the embedding isn't semantically close to anything else in the corpus.

The fix is hybrid retrieval: combine vector search with BM25 (a classic keyword-search algorithm). Run both queries, merge the results with reciprocal rank fusion.

Most production RAG systems use hybrid retrieval. Libraries that bundle it: LlamaIndex, Haystack, Vespa. Rolling your own is straightforward — run a BM25 search via rank_bm25 alongside the vector search, then merge.

A realistic cost model

For a 10,000-document corpus (averaging 20 pages each):

Conversion: ~2 hours of CPU time, effectively free
Cleanup script: 1 hour of developer time, then free
Chunking: minutes
Embedding: ~$5–20 with text-embedding-3-small
Storage: free on local FAISS; ~$70/month on Pinecone starter tier
Per-query inference: ~$0.01–0.05 with GPT-4o or Claude

Total to bootstrap a 10k-doc knowledge base: under $100. Ongoing costs are dominated by query volume — at 1000 queries per day, plan for $10–50/day in model costs.

Conclusion

Each stage looks simple in isolation and gets ugly when you assemble them. The pipeline only works as well as its weakest stage.

Start with the highest-quality conversion you can get. PDFs are the foundation of everything downstream, and a 90% accurate conversion turns into 80% useful chunks, which turns into 70% useful answers. Get conversion right first.

For getting your first PDFs into Markdown, the converter on this site is a no-install starting point. Once you've validated the per-document output, move to a local script for batch processing.

← Back to all guides