Extracting Academic Papers — Citations, References, Figures, and Equations

Academic papers are the worst common case for PDF extraction. Two-column layouts, dense in-line citations, equations that are sometimes images and sometimes typeset text, figures with multi-paragraph captions, footnotes in the gutter, and a bibliography section structured by a citation style that varies by journal. Convert one with a generic PDF-to-text tool and you'll get readable prose interleaved with garbage.

This guide focuses on what changes when the source is an academic paper, with concrete techniques for the parts that consistently break.

What you actually want out of an academic paper

Before extracting, know what you're after. The common goals each have a different optimal extraction:

The tool choice and effort budget differ across these. Quick-read extraction is fast and forgiving; replication-grade extraction is slow and demanding.

Picking the right extractor

For most academic-paper extractions, the relevant tools are:

For papers from major venues (arXiv, journal websites) the source PDFs are usually digital-text PDFs and one of the first four tools works. For older scanned papers from archives, an OCR pass is the prerequisite — see scanned PDFs to text.

Two-column layouts

The single most common source of garbled academic extraction: the converter reads across columns instead of down them, producing "left, right, left, right" zigzag prose.

How to spot it: read the first few sentences of the extracted output aloud. If the topic jumps wildly mid-sentence, you have a column-order problem.

How to fix it:

Handling references

The bibliography is structurally different from the rest of the paper and benefits from being extracted separately and parsed into structured entries.

The pattern that works:

  1. Locate the bibliography section. Look for headings: "References", "Bibliography", "Works Cited", "참고문헌", "Références". Usually the last section of the paper.
  2. Identify entry boundaries. Most styles separate entries by a blank line or a hanging indent. With Markdown output, you can split on \n\n after the References heading and treat each block as one entry.
  3. Parse each entry. Use a tool like anystyle, refextract, or a small GPT call with a structured-output prompt.

For citation styles, the most common shapes:

For a small batch, an LLM with a prompt like "parse the following bibliography into a JSON list of {authors, year, title, venue, volume, issue, pages, doi}" works well. For thousands of papers, GROBID or anystyle are faster and don't hallucinate.

In-line citations

In-line citations ([12], (Smith, 2023), Smith et al., 2023) are noise during reading and signal for citation analysis. The extraction decision depends on your use case:

A pragmatic pass to strip in-line citations:

import re

# Square-bracket numeric citations: [12], [12, 13], [12-14]
text = re.sub(r"\[\d+(?:[-,\s]+\d+)*\]", "", text)

# Author-year parenthetical: (Smith, 2023), (Smith and Jones, 2023)
text = re.sub(r"\([A-Z][a-zA-Z'-]+(?:\s+(?:et al\.|and\s+[A-Z][a-zA-Z'-]+))?,?\s+\d{4}[a-z]?\)", "", text)

Test on a sample before running across the whole corpus; some papers use parentheses for things that look like citations but aren't.

Equations

Equations in academic PDFs come in two forms, and the extraction differs:

By tool:

For papers where equations matter (machine learning, physics, theoretical CS), use Nougat or marker. For papers where equations are incidental (most empirical work in many fields), pymupdf4llm with a note that some equations may be missing is fine.

Figures and captions

Figures in PDFs are usually embedded images with separate text for the caption. Extraction has to do three things: extract the image, extract the caption, and link them together.

For papers where figures don't matter (text-only analysis, RAG over prose), strip figure images and keep the captions only — the captions usually contain the figure's main point in compact form.

Footnotes

Footnotes appear at the bottom of a page in the source PDF but are often referenced mid-paragraph. Extraction tools handle them inconsistently:

For LLM input, stripping footnotes entirely is often the right call — they're rarely necessary for understanding and they crowd the context window.

Headers, footers, and page numbers

Every page of an academic paper has at least the running title and page number, often the journal name, volume, and date. These show up in the extracted text as repeated noise.

The patterns that strip them:

For RAG pipelines especially, leaving this boilerplate in is a quiet quality killer — it pollutes embeddings and surfaces in retrieved chunks.

A practical pipeline for a folder of papers

A workflow that holds up:

  1. Filter to text-extractable PDFs. Run pymupdf's text extraction first; pages with no extractable text need OCR.
  2. Run marker or pymupdf4llm. Marker for higher quality, pymupdf4llm for speed.
  3. Strip headers, footers, and in-line citations. Use the patterns above.
  4. Extract and structure the bibliography separately. GROBID for accuracy, an LLM for ergonomics.
  5. Save the cleaned Markdown alongside the original PDF. Always keep the original for re-extraction when your pipeline improves.

For a small personal library, doing this one paper at a time with the converter on this site plus light manual cleanup is faster than building infrastructure. For thousands of papers, invest in marker or GROBID and the per-paper time drops dramatically.

← Back to all guides