Document Chunking Strategies for LLMs and RAG — Fixed, Recursive, and Semantic
Once you've converted a PDF to clean text, the next decision in almost any LLM workflow is how to split it. Chunking sounds trivial — cut the document into pieces — but it's the single most influential and most underestimated step in retrieval quality. Chunk badly and your RAG system retrieves the wrong passages, splits answers across boundaries, and confidently cites nothing useful. Chunk well and mediocre models start looking smart.
This guide covers the chunking strategies that matter, how to choose size and overlap, and the structure-aware approach that beats naive splitting on real documents.
Why chunking exists at all
Two hard constraints force it:
- Embedding model input limits. Embedding models encode a bounded amount of text into one vector. Feed a whole document and you either exceed the limit or dilute the meaning into a vector that represents everything and therefore nothing.
- Retrieval granularity. You want to retrieve the relevant passage, not the whole 80-page document. Smaller units mean more precise retrieval and less irrelevant context stuffed into the prompt.
So you split the document into chunks, embed each, and retrieve the chunks most relevant to a query. The whole pipeline is covered in building a RAG pipeline from PDFs; this guide zooms into the chunking decision itself.
The strategies, from worst to best
1. Fixed-size chunking (the naive baseline)
Cut every N characters or tokens. Simple, fast, and the default in many tutorials.
- Problem: it cuts blindly — mid-sentence, mid-table, mid-word. A definition gets separated from the term it defines; a number lands in a different chunk from its label.
- Verdict: acceptable only as a baseline. You can do much better for little extra effort.
2. Recursive character splitting (the sensible default)
Split on a hierarchy of separators — paragraphs first, then sentences, then words — only descending to a finer separator when a chunk is still too big. This is what LangChain's RecursiveCharacterTextSplitter does, and it's the right default for most projects.
- Why it works: it respects natural boundaries when it can, so chunks tend to end at paragraph or sentence breaks rather than mid-thought.
- Verdict: the best effort-to-quality ratio. Start here.
3. Structure-aware (Markdown-aware) chunking
This is where converting to Markdown pays off. If your document is Markdown, you can split on its structure — headings, sections, list boundaries — so each chunk corresponds to a logical unit of the document.
- Split on headings so a chunk is a coherent section, and prepend the heading path ("Chapter 3 > Risk Factors >") to each chunk so it carries its context.
- Keep tables and code blocks intact rather than slicing through them.
- Why it works: chunks align with how the document is actually organized, so retrieval returns self-contained, on-topic passages. This is a major reason Markdown is the preferred intermediate for LLM work — see Markdown vs plain text for LLMs.
- Verdict: the best general-purpose approach for structured documents. Combine it with recursive splitting within oversized sections.
4. Semantic chunking
Split where the meaning shifts: embed sentences, measure similarity between consecutive ones, and cut where similarity drops (a topic boundary).
- Why it works: boundaries fall at genuine topic transitions rather than arbitrary positions or even structural ones.
- Cost: more compute (you embed to decide where to split) and more complexity.
- Verdict: worth it for dense, unstructured prose where headings are sparse. Overkill for well-structured documents where heading-based splitting already aligns with topics.
Choosing chunk size
There's no universal best size, but there's a useful way to reason about it:
- Smaller chunks (≈100–300 tokens): precise retrieval, less irrelevant text in the prompt, but each chunk may lack surrounding context. Good for fact lookup ("what's the warranty period?").
- Larger chunks (≈500–1000 tokens): more context per chunk, fewer total chunks, but retrieval is coarser and you spend more prompt budget per hit. Good for questions needing broader context ("explain the methodology").
- Common sweet spot: 300–600 tokens for mixed Q&A workloads.
Match size to your query type and embedding model. Some embedding models are tuned for short passages and degrade on long ones; check your model's recommended input length. The honest method is to test: build the index at two or three sizes and measure retrieval quality on real questions.
Overlap: how much and why
Overlap repeats some text between adjacent chunks (e.g. the last 50 tokens of one chunk start the next). It prevents an answer that straddles a boundary from being lost in the gap.
- Typical: 10–20% overlap (e.g. 50–100 tokens on a 500-token chunk).
- Too little: boundary-straddling facts get cut in half and retrieved by neither chunk.
- Too much: duplicated content bloats the index, returns near-identical chunks, and wastes prompt space.
Structure-aware chunking needs less overlap, because boundaries already fall at natural breaks where ideas complete.
The mistakes that quietly wreck retrieval
- Splitting tables across chunks. Half a table is worse than useless — rows lose their headers. Keep tables whole; if one is huge, repeat the header row in each piece. See PDF tables to Markdown.
- Dropping the heading context. A chunk reading "It increased 12% year over year" is unretrievable and uninterpretable without knowing what increased. Prepend section headings to every chunk.
- Chunking broken text. If the PDF extraction was garbled (bad reading order, ligature errors), every chunk inherits the damage. Fix extraction first — reading order and why text won't copy cover the common causes.
- Ignoring metadata. Store source document, page number, and section with each chunk. It enables citations ("see page 14") and lets you filter retrieval by document.
- One size for all document types. A legal contract, a chat log, and a textbook want different chunking. Route by type if your corpus is mixed.
A practical recipe
For most PDF-derived corpora, this works well out of the box:
- Convert to clean Markdown (this site's converter or pymupdf4llm/marker).
- Split on Markdown headings into sections.
- For any section over your size limit, apply recursive character splitting within it.
- Prepend the heading path to each chunk.
- Use 10–15% overlap.
- Attach metadata: source, page, section.
- Keep tables and code blocks intact.
Then measure retrieval on real queries and adjust size before reaching for anything fancier.
Quick reference
- Default choice? Recursive splitting, upgraded to Markdown-structure-aware if your docs have headings.
- Dense prose, few headings? Consider semantic chunking.
- Chunk size? 300–600 tokens for mixed Q&A; smaller for fact lookup, larger for context-heavy questions.
- Overlap? 10–20%, less if you split on structure.
- Never: split tables, drop heading context, or chunk garbled text.
Conclusion
Chunking is where a lot of "the model isn't finding the answer" problems actually live — and the fix is rarely a bigger model, it's smarter splitting. Structure-aware chunking on clean Markdown, sized to your queries with modest overlap and good metadata, gives most projects the biggest retrieval gain for the least effort.
Since structure-aware chunking depends on having structure to split on, it starts with a good conversion: turn your PDF into Markdown here, then split on the headings it preserves.
← Back to all guides