Markdown vs Plain Text — Which Format Should You Feed ChatGPT and Claude?
You have a 40-page PDF and you want to ask Claude or ChatGPT questions about it. You can paste it as plain text, paste it as Markdown, upload the PDF directly, or chunk it and index it for retrieval. Most people pick the first option that seems to work. But the format you choose changes answer quality, token cost, and what the model actually sees on the other end.
This article compares the realistic options on the same document and explains which one to pick for which job.
Why format matters more than you'd think
LLMs read documents as token streams — flat sequences of subword pieces. Structure that's obvious to a human (headings that organize sections, lists that group related items, tables that align columns) has to be encoded into the text itself or it gets lost.
A model given a wall of plain text has to infer document structure on the fly. Sometimes it gets it right; often it doesn't. A common failure mode: the model treats a numbered list embedded in prose as the only thing that matters and ignores surrounding paragraphs that contain the actual answer. The same content in Markdown gives the model an explicit map of the document — H2 sections, ordered lists, pipe-syntax tables — and the model uses that map when deciding which parts to attend to.
This matters more in 2026 than it did two years ago. The current generation of frontier models — GPT-4o, Claude 4.x, Gemini 2.x — was trained explicitly on Markdown-flavored corpora. They parse Markdown the same way they parse code: as a structured document, not as decorated prose.
The four practical options compared
| Format | Token cost | Structure preserved | Setup effort | Best when |
|---|---|---|---|---|
| Plain text | Lowest | None | None | Short notes, single-section content |
| Markdown | Small overhead (~3%) | Headings, lists, tables, emphasis | Convert once | Most documents |
| Raw PDF upload | Variable, often 2–3× higher | Yes (model parses visually) | None | Layout carries meaning |
| Chunked + indexed (RAG) | Lowest per query | Depends on chunker | Significant | Repeated queries over large corpus |
Plain text is the path of least resistance. Copy-paste from a PDF, drop into the chat, ask a question. It works for short, structurally simple content — a memo, a single email, a few paragraphs from a longer document. It falls down on anything with headings, lists, or tables: the model loses the document's shape and starts guessing.
Markdown is the strongest default for converted PDF content. Headings give the model an explicit navigation map. Lists stay as lists. Tables stay structured. Code blocks stay intact. The token overhead is small — Markdown adds roughly 3% to a plain-text baseline because the syntax is compact (# H1, - item, |cell|). For 80% of PDF-to-LLM workflows, this is the right format.
Raw PDF upload is what Claude, ChatGPT, and Gemini all support natively now. You drop the PDF in and the model parses it (with its own vision pipeline). This is useful when visual layout carries meaning — architectural drawings, design briefs where image positioning matters, multi-column scientific papers with figures interleaved through the text. The cost is roughly 2–3× higher in tokens than a Markdown conversion of the same document, and the model's parser sometimes makes different choices than a converter would.
Chunked + indexed (RAG) is the right answer for repeated queries over a large corpus. You convert your documents to Markdown, chunk them, embed each chunk, store the embeddings in a vector database, and at query time retrieve only the chunks relevant to the question. The per-query token cost is the lowest of the four options. The setup cost is much higher. See the end-to-end PDF-to-LLM workflow for details.
Why Markdown wins for most workflows
Several reasons compound:
- Headings encode hierarchy. A
## Methodssection followed by a### Sample selectionsubsection tells the model where it is in the document. When you ask "what sample size did the authors use?", the model can locate the right section instead of scanning the whole document. - Lists preserve as lists. A bulleted list of side effects, a numbered list of steps, an ordered set of references — all keep their structure in Markdown and read as discrete items rather than collapsed prose.
- Tables stay structured. Pipe-syntax tables (
| col1 | col2 |) survive the round-trip from PDF to converter to model. The same data in plain text collapses into space-separated runs that the model has to disentangle. - Code blocks stay intact. Technical PDFs (API references, configuration guides) often have code snippets. Markdown fenced code blocks preserve them; plain text mashes indentation and breaks the snippet.
- Token overhead is negligible. Markdown adds about 3% over plain text on typical documents. HTML, by contrast, adds 20% or more.
When NOT to use Markdown
A few situations where the conversion isn't worth it:
- Pure prose with no structure. A novel chapter, a press release, a single-paragraph status update — converting to Markdown changes nothing meaningful, since there's no structure to preserve.
- Documents where visual layout is the point. Floorplans, infographics, design mockups, image-heavy product catalogs. Upload the PDF directly so the model can use vision.
- Strict token budgets near the context limit. If you're working at the edge of a model's context window, every percent counts. Strip to plain text.
- Heavy math notation. Neither Markdown nor plain text round-trips LaTeX cleanly without special handling. For papers with equations, use Pandoc with a math-aware backend, or screenshot the equation regions and let the model use vision.
Practical workflow
A repeatable pipeline for one-off PDF questions:
- Convert the PDF to Markdown. Use the converter on this site, pymupdf4llm, marker, or any tool that outputs Markdown. For tool comparison, see PDF extraction methods compared.
- Skim the output for obvious problems — broken tables, missing sections, OCR garbles if the source was scanned.
- Spend five minutes fixing the structural issues. Promote any titles that came out as bold text to
# H1. Re-flow broken tables. Strip running headers and footers. - Paste the cleaned Markdown into your model of choice with a brief system instruction: "You are reading a Markdown rendering of a PDF document. Tables use pipe syntax. Cite the section heading when you quote content."
- Ask your questions.
For repeated queries over the same document, save the cleaned Markdown locally so you don't repeat the conversion and cleanup work.
A note on token cost
Real numbers from converting a 25,000-character technical PDF (a SaaS API reference):
- Raw plain text: 6,200 tokens
- Markdown: 6,400 tokens (+3.2%)
- HTML: 7,800 tokens (+26%)
- PDF upload to Claude or GPT-4o: ~17,000 tokens (vision-pipeline overhead)
The structural benefits of Markdown cost almost nothing. The vision-based PDF upload route is over 2.5× the token count of a clean Markdown conversion, which matters when you're paying per-token or when the document is large enough to push against context limits.
Conclusion
For most PDF-to-LLM workflows, Markdown is the right default. Convert once, query many times, get better answers for negligible extra cost. Reach for raw PDF upload only when visual layout carries meaning. Reach for RAG when you'll query the same corpus repeatedly.
If you don't already have a conversion tool, the converter on this site outputs Markdown by default and includes OCR for scanned pages.
← Back to all guides