Fixing Reading Order in Multi-Column and Magazine PDFs
You convert a two-column academic paper and the output reads like a ransom note: the first line of column one, then the first line of column two, then back to column one. Every sentence is sliced in half and interleaved with a different sentence. The text is all there — it's just in the wrong order.
This is the reading order problem, and it's one of the most common reasons a perfectly text-based PDF produces unusable output. Here's why it happens and how to fix it.
Why reading order breaks
A PDF doesn't store paragraphs or columns. It stores individual characters, each placed at an (x, y) coordinate on the page. The grouping you see — "this is column one, this is column two" — exists only in your visual perception. The file itself has no concept of it.
Worse, the characters aren't necessarily stored in reading order. A PDF's internal content stream can list glyphs in any sequence: the order they were drawn, the order fonts were loaded, or some order the generating software found convenient. A naive extractor that just dumps text in stored order gets whatever the generator happened to do.
To produce correct output, an extractor has to reconstruct reading order from geometry:
- Group nearby characters into words, words into lines.
- Detect that lines fall into vertical bands (columns).
- Decide the order of those bands (left column fully, then right column).
- Handle elements that break the grid: spanning headlines, figures, captions, pull quotes, footnotes.
Each step is heuristic, and the heuristics fail on real layouts.
The failure modes you'll see
- Column interleaving. The classic. Lines alternate between columns because the extractor reads strictly left-to-right across the full page width, ignoring the column gap.
- Spanning headers absorbed mid-column. A title that spans both columns gets pulled into whichever column the extractor was reading, dropping it into the middle of a sentence.
- Sidebar bleed. A boxed sidebar or pull quote gets merged into the main flow at the wrong point.
- Caption drift. Figure and table captions land far from their figure, or interrupt a paragraph.
- Footnote injection. Bottom-of-page footnotes get spliced into the body text above them.
- Reversed columns. Right column emitted before left — common in PDFs generated from right-to-left language tools, even for the English text inside.
Recognizing the pattern tells you whether the fix is a setting, a different tool, or a layout model.
How tools handle it, ranked
Reading order quality varies enormously between extractors:
- Layout-model tools (marker, AWS Textract, Azure Document Intelligence, Google Document AI) — these run an ML model that detects regions (columns, figures, headers) and orders them explicitly. They are the most reliable on complex layouts because they see the page structure rather than guessing from coordinates. Best results, with cost or install weight as the trade-off.
- pymupdf4llm — uses PyMuPDF's block detection and generally handles clean two-column layouts well. It can still stumble on spanning elements and dense magazine pages.
- pdfplumber — exposes word coordinates and lets you implement column logic yourself; good if you're willing to write the sorting code, mediocre out of the box.
- PyMuPDF (
get_text("blocks")) — returns text blocks with bounding boxes. Sorting blocks by(column, top)yourself often fixes simple two-column cases in a few lines. - pdfminer.six — has a
LAParamslayout analyzer with tunable column detection, but defaults frequently interleave; expect to tune. - pdftotext (
-layoutflag) — the-layoutoption preserves visual position using whitespace, which accidentally keeps columns side by side as ASCII; useful for eyeballing but not for clean Markdown.
Fixing it yourself with block coordinates
For simple two-column documents, you don't need a layout model. If your tool can return text blocks with bounding boxes, you can re-sort them. The logic with PyMuPDF:
import fitz # PyMuPDF
doc = fitz.open("paper.pdf")
page = doc[0]
blocks = page.get_text("blocks") # (x0, y0, x1, y1, text, ...)
mid_x = page.rect.width / 2
left = [b for b in blocks if b[0] < mid_x]
right = [b for b in blocks if b[0] >= mid_x]
# Read each column top-to-bottom, left column first
ordered = sorted(left, key=lambda b: b[1]) + sorted(right, key=lambda b: b[1])
text = "\n".join(b[4] for b in ordered)
This handles the common case. To make it robust you'd detect the column boundary dynamically (cluster the x0 values rather than splitting at the midpoint) and special-case full-width blocks (where x1 - x0 is nearly the page width) as spanning headers that interrupt the column flow. But the midpoint split alone rescues a large share of two-column papers.
When to reach for a layout model
Hand-rolled column sorting breaks down on:
- Three or more columns (newspapers, dictionaries).
- Irregular grids where column count changes down the page.
- Magazine layouts with text wrapping around figures, pull quotes, and boxed asides.
- Mixed content where tables, figures, and multi-column text share a page.
For any of these, stop fighting coordinates and use a tool with a real layout model. marker is the strongest open-source option; Textract and Azure Document Intelligence are the strongest paid ones. They were built precisely for the cases where geometry heuristics give up. See cloud OCR services compared for the paid options and academic paper extraction for the two-column-paper case specifically.
A note on scanned multi-column pages
If the document is a scan, reading order and OCR interact. OCR engines like Tesseract have their own page segmentation step (the --psm mode) that tries to detect columns before recognizing text. The wrong segmentation mode scrambles a scanned two-column page just as badly as a born-digital one. Tesseract's default (--psm 3, automatic page segmentation) usually detects columns; if it interleaves, the Tesseract OCR guide covers the segmentation modes. Clean preprocessing helps the segmentation step too — see image preprocessing for OCR.
Quick reference
- Clean two-column paper, occasional interleaving? Try pymupdf4llm first; if it fails, sort blocks by column yourself.
- Need a few lines of code, not a new tool? PyMuPDF
get_text("blocks")+ column sort. - Three+ columns, magazines, figure wrapping? Layout-model tool (marker / Textract / Azure).
- Scanned multi-column? Check Tesseract's page segmentation mode before blaming the extractor.
Conclusion
Scrambled multi-column output is almost never missing data — it's correctly extracted text in the wrong sequence, because the PDF never stored a sequence to begin with. The fix scales with layout complexity: a midpoint column split rescues simple papers, while genuine magazine layouts need a model that detects regions visually.
For straightforward documents, the converter here uses pymupdf4llm, which handles clean two-column layouts; for dense or irregular ones, reach for a layout-aware tool before assuming the content was lost.
← Back to all guides