Document Chunking Strategies for LLMs and RAG — Fixed, Recursive, and Semantic

Once you've converted a PDF to clean text, the next decision in almost any LLM workflow is how to split it. Chunking sounds trivial — cut the document into pieces — but it's the single most influential and most underestimated step in retrieval quality. Chunk badly and your RAG system retrieves the wrong passages, splits answers across boundaries, and confidently cites nothing useful. Chunk well and mediocre models start looking smart.

This guide covers the chunking strategies that matter, how to choose size and overlap, and the structure-aware approach that beats naive splitting on real documents.

Why chunking exists at all

Two hard constraints force it:

So you split the document into chunks, embed each, and retrieve the chunks most relevant to a query. The whole pipeline is covered in building a RAG pipeline from PDFs; this guide zooms into the chunking decision itself.

The strategies, from worst to best

1. Fixed-size chunking (the naive baseline)

Cut every N characters or tokens. Simple, fast, and the default in many tutorials.

2. Recursive character splitting (the sensible default)

Split on a hierarchy of separators — paragraphs first, then sentences, then words — only descending to a finer separator when a chunk is still too big. This is what LangChain's RecursiveCharacterTextSplitter does, and it's the right default for most projects.

3. Structure-aware (Markdown-aware) chunking

This is where converting to Markdown pays off. If your document is Markdown, you can split on its structure — headings, sections, list boundaries — so each chunk corresponds to a logical unit of the document.

4. Semantic chunking

Split where the meaning shifts: embed sentences, measure similarity between consecutive ones, and cut where similarity drops (a topic boundary).

Choosing chunk size

There's no universal best size, but there's a useful way to reason about it:

Match size to your query type and embedding model. Some embedding models are tuned for short passages and degrade on long ones; check your model's recommended input length. The honest method is to test: build the index at two or three sizes and measure retrieval quality on real questions.

Overlap: how much and why

Overlap repeats some text between adjacent chunks (e.g. the last 50 tokens of one chunk start the next). It prevents an answer that straddles a boundary from being lost in the gap.

Structure-aware chunking needs less overlap, because boundaries already fall at natural breaks where ideas complete.

The mistakes that quietly wreck retrieval

A practical recipe

For most PDF-derived corpora, this works well out of the box:

  1. Convert to clean Markdown (this site's converter or pymupdf4llm/marker).
  2. Split on Markdown headings into sections.
  3. For any section over your size limit, apply recursive character splitting within it.
  4. Prepend the heading path to each chunk.
  5. Use 10–15% overlap.
  6. Attach metadata: source, page, section.
  7. Keep tables and code blocks intact.

Then measure retrieval on real queries and adjust size before reaching for anything fancier.

Quick reference

Conclusion

Chunking is where a lot of "the model isn't finding the answer" problems actually live — and the fix is rarely a bigger model, it's smarter splitting. Structure-aware chunking on clean Markdown, sized to your queries with modest overlap and good metadata, gives most projects the biggest retrieval gain for the least effort.

Since structure-aware chunking depends on having structure to split on, it starts with a good conversion: turn your PDF into Markdown here, then split on the headings it preserves.

← Back to all guides