Markdown vs Plain Text — Which Format Should You Feed ChatGPT and Claude?

You have a 40-page PDF and you want to ask Claude or ChatGPT questions about it. You can paste it as plain text, paste it as Markdown, upload the PDF directly, or chunk it and index it for retrieval. Most people pick the first option that seems to work. But the format you choose changes answer quality, token cost, and what the model actually sees on the other end.

This article compares the realistic options on the same document and explains which one to pick for which job.

Why format matters more than you'd think

LLMs read documents as token streams — flat sequences of subword pieces. Structure that's obvious to a human (headings that organize sections, lists that group related items, tables that align columns) has to be encoded into the text itself or it gets lost.

A model given a wall of plain text has to infer document structure on the fly. Sometimes it gets it right; often it doesn't. A common failure mode: the model treats a numbered list embedded in prose as the only thing that matters and ignores surrounding paragraphs that contain the actual answer. The same content in Markdown gives the model an explicit map of the document — H2 sections, ordered lists, pipe-syntax tables — and the model uses that map when deciding which parts to attend to.

This matters more in 2026 than it did two years ago. The current generation of frontier models — GPT-4o, Claude 4.x, Gemini 2.x — was trained explicitly on Markdown-flavored corpora. They parse Markdown the same way they parse code: as a structured document, not as decorated prose.

The four practical options compared

Format Token cost Structure preserved Setup effort Best when
Plain text Lowest None None Short notes, single-section content
Markdown Small overhead (~3%) Headings, lists, tables, emphasis Convert once Most documents
Raw PDF upload Variable, often 2–3× higher Yes (model parses visually) None Layout carries meaning
Chunked + indexed (RAG) Lowest per query Depends on chunker Significant Repeated queries over large corpus

Plain text is the path of least resistance. Copy-paste from a PDF, drop into the chat, ask a question. It works for short, structurally simple content — a memo, a single email, a few paragraphs from a longer document. It falls down on anything with headings, lists, or tables: the model loses the document's shape and starts guessing.

Markdown is the strongest default for converted PDF content. Headings give the model an explicit navigation map. Lists stay as lists. Tables stay structured. Code blocks stay intact. The token overhead is small — Markdown adds roughly 3% to a plain-text baseline because the syntax is compact (# H1, - item, |cell|). For 80% of PDF-to-LLM workflows, this is the right format.

Raw PDF upload is what Claude, ChatGPT, and Gemini all support natively now. You drop the PDF in and the model parses it (with its own vision pipeline). This is useful when visual layout carries meaning — architectural drawings, design briefs where image positioning matters, multi-column scientific papers with figures interleaved through the text. The cost is roughly 2–3× higher in tokens than a Markdown conversion of the same document, and the model's parser sometimes makes different choices than a converter would.

Chunked + indexed (RAG) is the right answer for repeated queries over a large corpus. You convert your documents to Markdown, chunk them, embed each chunk, store the embeddings in a vector database, and at query time retrieve only the chunks relevant to the question. The per-query token cost is the lowest of the four options. The setup cost is much higher. See the end-to-end PDF-to-LLM workflow for details.

Why Markdown wins for most workflows

Several reasons compound:

When NOT to use Markdown

A few situations where the conversion isn't worth it:

Practical workflow

A repeatable pipeline for one-off PDF questions:

  1. Convert the PDF to Markdown. Use the converter on this site, pymupdf4llm, marker, or any tool that outputs Markdown. For tool comparison, see PDF extraction methods compared.
  2. Skim the output for obvious problems — broken tables, missing sections, OCR garbles if the source was scanned.
  3. Spend five minutes fixing the structural issues. Promote any titles that came out as bold text to # H1. Re-flow broken tables. Strip running headers and footers.
  4. Paste the cleaned Markdown into your model of choice with a brief system instruction: "You are reading a Markdown rendering of a PDF document. Tables use pipe syntax. Cite the section heading when you quote content."
  5. Ask your questions.

For repeated queries over the same document, save the cleaned Markdown locally so you don't repeat the conversion and cleanup work.

A note on token cost

Real numbers from converting a 25,000-character technical PDF (a SaaS API reference):

The structural benefits of Markdown cost almost nothing. The vision-based PDF upload route is over 2.5× the token count of a clean Markdown conversion, which matters when you're paying per-token or when the document is large enough to push against context limits.

Conclusion

For most PDF-to-LLM workflows, Markdown is the right default. Convert once, query many times, get better answers for negligible extra cost. Reach for raw PDF upload only when visual layout carries meaning. Reach for RAG when you'll query the same corpus repeatedly.

If you don't already have a conversion tool, the converter on this site outputs Markdown by default and includes OCR for scanned pages.

← Back to all guides