Fixing Reading Order in Multi-Column and Magazine PDFs

You convert a two-column academic paper and the output reads like a ransom note: the first line of column one, then the first line of column two, then back to column one. Every sentence is sliced in half and interleaved with a different sentence. The text is all there — it's just in the wrong order.

This is the reading order problem, and it's one of the most common reasons a perfectly text-based PDF produces unusable output. Here's why it happens and how to fix it.

Why reading order breaks

A PDF doesn't store paragraphs or columns. It stores individual characters, each placed at an (x, y) coordinate on the page. The grouping you see — "this is column one, this is column two" — exists only in your visual perception. The file itself has no concept of it.

Worse, the characters aren't necessarily stored in reading order. A PDF's internal content stream can list glyphs in any sequence: the order they were drawn, the order fonts were loaded, or some order the generating software found convenient. A naive extractor that just dumps text in stored order gets whatever the generator happened to do.

To produce correct output, an extractor has to reconstruct reading order from geometry:

  1. Group nearby characters into words, words into lines.
  2. Detect that lines fall into vertical bands (columns).
  3. Decide the order of those bands (left column fully, then right column).
  4. Handle elements that break the grid: spanning headlines, figures, captions, pull quotes, footnotes.

Each step is heuristic, and the heuristics fail on real layouts.

The failure modes you'll see

Recognizing the pattern tells you whether the fix is a setting, a different tool, or a layout model.

How tools handle it, ranked

Reading order quality varies enormously between extractors:

Fixing it yourself with block coordinates

For simple two-column documents, you don't need a layout model. If your tool can return text blocks with bounding boxes, you can re-sort them. The logic with PyMuPDF:

import fitz  # PyMuPDF

doc = fitz.open("paper.pdf")
page = doc[0]
blocks = page.get_text("blocks")  # (x0, y0, x1, y1, text, ...)

mid_x = page.rect.width / 2
left  = [b for b in blocks if b[0] < mid_x]
right = [b for b in blocks if b[0] >= mid_x]

# Read each column top-to-bottom, left column first
ordered = sorted(left, key=lambda b: b[1]) + sorted(right, key=lambda b: b[1])
text = "\n".join(b[4] for b in ordered)

This handles the common case. To make it robust you'd detect the column boundary dynamically (cluster the x0 values rather than splitting at the midpoint) and special-case full-width blocks (where x1 - x0 is nearly the page width) as spanning headers that interrupt the column flow. But the midpoint split alone rescues a large share of two-column papers.

When to reach for a layout model

Hand-rolled column sorting breaks down on:

For any of these, stop fighting coordinates and use a tool with a real layout model. marker is the strongest open-source option; Textract and Azure Document Intelligence are the strongest paid ones. They were built precisely for the cases where geometry heuristics give up. See cloud OCR services compared for the paid options and academic paper extraction for the two-column-paper case specifically.

A note on scanned multi-column pages

If the document is a scan, reading order and OCR interact. OCR engines like Tesseract have their own page segmentation step (the --psm mode) that tries to detect columns before recognizing text. The wrong segmentation mode scrambles a scanned two-column page just as badly as a born-digital one. Tesseract's default (--psm 3, automatic page segmentation) usually detects columns; if it interleaves, the Tesseract OCR guide covers the segmentation modes. Clean preprocessing helps the segmentation step too — see image preprocessing for OCR.

Quick reference

Conclusion

Scrambled multi-column output is almost never missing data — it's correctly extracted text in the wrong sequence, because the PDF never stored a sequence to begin with. The fix scales with layout complexity: a midpoint column split rescues simple papers, while genuine magazine layouts need a model that detects regions visually.

For straightforward documents, the converter here uses pymupdf4llm, which handles clean two-column layouts; for dense or irregular ones, reach for a layout-aware tool before assuming the content was lost.

← Back to all guides