Converting PDFs to EPUB and Ebook Formats — Reflowable Text from Fixed Pages
A PDF is a fixed page. An EPUB is reflowable text that adapts to any screen, font size, and orientation. Converting between them isn't a format swap — it's a fundamental change in how the content is structured. That's why so many "PDF to EPUB" conversions produce ebooks that are unreadable on a phone: tiny fixed text, broken line wrapping, page numbers stranded mid-sentence.
This guide explains why the conversion is genuinely hard and lays out an approach that produces a real reflowable ebook rather than a PDF in an EPUB wrapper.
Why PDF → EPUB is harder than it sounds
PDF and EPUB have opposite design goals:
- PDF is fixed-layout. Every character sits at a fixed coordinate on a page of fixed size. The whole point is that it looks identical everywhere.
- EPUB is reflowable. Text flows to fill whatever screen it's on. Font size, margins, and line breaks are the reader's choice, not the document's.
To convert properly you have to throw away the page layout and recover the underlying logical structure: which text is a chapter heading, which is body, where paragraphs begin and end, where one chapter stops and the next starts. PDFs don't store any of that — they store positioned glyphs (the same root issue behind why PDF text won't copy and reading order problems).
The naive converters that just embed each PDF page as an image, or dump raw positioned text into one HTML blob, skip this reconstruction entirely. The result technically opens in an e-reader but doesn't reflow — defeating the purpose.
The key insight: convert to Markdown first
The cleanest path from PDF to a good EPUB goes through a structured intermediate format, and Markdown is ideal for it. The pipeline:
PDF → Markdown (recover structure) → EPUB (apply ebook formatting)
Markdown forces the content into logical structure — headings, paragraphs, lists, emphasis — exactly the structure an EPUB needs and exactly what the PDF lost. Once you have clean Markdown, generating a valid, reflowable EPUB is a solved problem. This is the same reason Markdown works as a hub format for so many workflows; see building a PDF-to-LLM workflow for the general pattern.
So the hard part is step one — getting clean, well-structured Markdown — and the rest is mechanical.
Step 1: PDF to clean Markdown
Use a converter that preserves heading hierarchy and paragraph structure (this site's converter produces page-structured Markdown; pymupdf4llm and marker are good library options). Then clean it up, because ebook readers are unforgiving of artifacts:
- Remove running headers and footers. The book title or chapter name repeated at the top/bottom of every page becomes noise scattered through the reflowed text. Strip these.
- Remove page numbers. Stranded page numbers mid-text look broken when the pages no longer exist.
- Fix hyphenation. PDFs hyphenate words at line ends (
exam-\nple). In reflowed text these must be rejoined toexample, or the hyphen appears mid-line. - Rejoin paragraphs. A paragraph broken across PDF lines should be one continuous paragraph; a real paragraph break should stay. Distinguishing them is the fiddliest part.
- Verify heading levels. Make sure chapter titles are
#/##so they become navigable chapters, not body text.
The hyphenation and paragraph-rejoining cleanup is where most of the manual effort goes, and it's worth doing — it's the difference between a polished ebook and an obviously-converted one.
Step 2: Markdown to EPUB
With clean Markdown, Pandoc is the standard tool:
pandoc book.md -o book.epub \
--metadata title="My Book" \
--metadata author="Author Name" \
--toc --toc-depth=2 \
--epub-cover-image=cover.jpg
This produces a valid EPUB 3 with:
- A working table of contents generated from your headings
- Reflowable text that adapts to any device
- Embedded cover and metadata
- Proper chapter navigation
--toc-depth controls how deep the navigation goes; --split-level (or --epub-chapter-level in older Pandoc) controls where the book splits into separate chapter files, which affects load performance on e-readers.
The alternative: Calibre
Calibre is the other major path and goes PDF → EPUB directly, with a built-in conversion engine and heuristics for detecting chapters and removing headers/footers. It's more convenient (GUI, one step) but the structure recovery is less controllable than the Markdown route. For a quick personal conversion, Calibre is fine; for a clean, distributable ebook, the Markdown-intermediate path gives better results because you can fix the structure before generating the EPUB.
Calibre's "Heuristic processing" options (in the conversion dialog) help a lot — enable them to auto-remove headers/footers and fix hyphenation. Its editor also lets you fix the EPUB after conversion.
When the PDF is scanned
If the source is a scanned book, you need OCR before any of this — there's no text to restructure until OCR creates it. Run the scan through OCR (see scanned PDF to text), accept that OCR errors will need proofreading, then enter the Markdown pipeline above. For book-length scans, budget real proofreading time: at even 99% character accuracy, a 300-page book has thousands of errors.
What won't convert well
Set expectations. Some PDF content doesn't survive the trip to reflowable text:
- Complex layouts (textbooks with sidebars, magazines) lose their spatial design — there's no reflowable equivalent.
- Tables are awkward in EPUB and on small screens; wide tables become near-unusable (related: PDF tables to Markdown).
- Equations need MathML to reflow properly; many converters rasterize them to images, which don't scale with font size.
- Fixed-position figures float to wherever the reflow puts them, away from their original context.
For heavily-designed content, a fixed-layout EPUB (which preserves the design but doesn't reflow) or simply keeping the PDF may be the honest answer.
Quick reference
- Want a real reflowable ebook? Go PDF → clean Markdown → Pandoc EPUB.
- Quick personal conversion? Calibre with heuristic processing enabled.
- Scanned book? OCR first, then proofread, then the Markdown pipeline.
- Heavy layout, tables, equations? Expect manual work, or consider fixed-layout EPUB.
- The real work is cleaning the Markdown: kill headers/footers/page numbers, fix hyphenation, rejoin paragraphs.
Conclusion
A good PDF-to-EPUB conversion is really a structure-recovery problem wearing a format-conversion costume. Routing through Markdown makes that explicit: it forces the content back into the logical structure the PDF discarded, and from there Pandoc turns out a clean, reflowable ebook in one command.
To get started on step one, the converter here will turn your PDF into structured Markdown — including OCR for scanned pages — giving you the intermediate file the rest of the pipeline needs.
← Back to all guides