Extracting Academic Papers — Citations, References, Figures, and Equations
Academic papers are the worst common case for PDF extraction. Two-column layouts, dense in-line citations, equations that are sometimes images and sometimes typeset text, figures with multi-paragraph captions, footnotes in the gutter, and a bibliography section structured by a citation style that varies by journal. Convert one with a generic PDF-to-text tool and you'll get readable prose interleaved with garbage.
This guide focuses on what changes when the source is an academic paper, with concrete techniques for the parts that consistently break.
What you actually want out of an academic paper
Before extracting, know what you're after. The common goals each have a different optimal extraction:
- Quick read — clean prose with sections; you can ignore broken figures and tables.
- Literature review build-out — main text plus the bibliography parsed into structured entries (author, year, title, journal).
- Citation graph — only the bibliography; pair with the paper's own metadata to build a graph of who cites whom.
- Replication of analysis — figures, tables, and the methods section preserved with high fidelity.
- LLM ingestion for Q&A — everything in clean Markdown, ready for a RAG pipeline.
The tool choice and effort budget differ across these. Quick-read extraction is fast and forgiving; replication-grade extraction is slow and demanding.
Picking the right extractor
For most academic-paper extractions, the relevant tools are:
- pymupdf4llm — fast, free, Markdown output, decent column handling. The pragmatic starting point.
- marker — ML-based, much better on complex layouts, slower (10–30 sec/page on CPU). The right pick when quality matters and budget allows.
- GROBID — academic-paper-specialized; outputs TEI XML with explicit fields for title, abstract, authors, sections, references. Best-in-class for structured bibliography extraction.
- Nougat — Meta's academic-paper transformer, optimized for math and tables; outputs Markdown with LaTeX. Slow but produces very clean output on math-heavy papers.
- Cloud OCR services — when the paper is scanned or has been re-distilled in a way that breaks text extraction.
For papers from major venues (arXiv, journal websites) the source PDFs are usually digital-text PDFs and one of the first four tools works. For older scanned papers from archives, an OCR pass is the prerequisite — see scanned PDFs to text.
Two-column layouts
The single most common source of garbled academic extraction: the converter reads across columns instead of down them, producing "left, right, left, right" zigzag prose.
How to spot it: read the first few sentences of the extracted output aloud. If the topic jumps wildly mid-sentence, you have a column-order problem.
How to fix it:
- pymupdf4llm and marker both handle columns correctly by default. If you're using pymupdf directly (not pymupdf4llm), call
page.get_text(sort=True)and accept that the heuristic isn't perfect. - Manually verify on the first page. If column ordering looks correct on page 1, it's likely correct throughout — the layout is consistent within a paper.
- For pathological multi-column layouts (3+ columns, irregular widths), use marker or fall back to a vision model. The heuristic-based tools fail too often.
Handling references
The bibliography is structurally different from the rest of the paper and benefits from being extracted separately and parsed into structured entries.
The pattern that works:
- Locate the bibliography section. Look for headings: "References", "Bibliography", "Works Cited", "참고문헌", "Références". Usually the last section of the paper.
- Identify entry boundaries. Most styles separate entries by a blank line or a hanging indent. With Markdown output, you can split on
\n\nafter the References heading and treat each block as one entry. - Parse each entry. Use a tool like
anystyle,refextract, or a small GPT call with a structured-output prompt.
For citation styles, the most common shapes:
- APA — Author, A. A. (Year). Title. Journal, volume(issue), pages.
- MLA — Author, A. A. "Title." Journal, vol. X, no. Y, Year, pp. Z.
- Chicago/Turabian — Notes vs. author-date variants. Notes style uses footnotes; the bibliography parser can usually ignore those.
- IEEE — [N] A. A. Author, "Title," Journal, vol. X, no. Y, pp. Z, Year.
- Vancouver — N. Author A, Author B. Title. Journal. Year;volume(issue):pages.
For a small batch, an LLM with a prompt like "parse the following bibliography into a JSON list of {authors, year, title, venue, volume, issue, pages, doi}" works well. For thousands of papers, GROBID or anystyle are faster and don't hallucinate.
In-line citations
In-line citations ([12], (Smith, 2023), Smith et al., 2023) are noise during reading and signal for citation analysis. The extraction decision depends on your use case:
- For reading or LLM input: keep them, but make sure they survive extraction without getting mangled.
[12, 13, 14]should stay as one token, not get split across columns. - For citation analysis: extract them as structured references that link to the bibliography entries. GROBID does this natively; with other tools you'll write a regex pass.
- For RAG retrieval: strip them. The embedding model wastes capacity on the citation noise, and "Smith et al., 2023" doesn't help semantic search.
A pragmatic pass to strip in-line citations:
import re
# Square-bracket numeric citations: [12], [12, 13], [12-14]
text = re.sub(r"\[\d+(?:[-,\s]+\d+)*\]", "", text)
# Author-year parenthetical: (Smith, 2023), (Smith and Jones, 2023)
text = re.sub(r"\([A-Z][a-zA-Z'-]+(?:\s+(?:et al\.|and\s+[A-Z][a-zA-Z'-]+))?,?\s+\d{4}[a-z]?\)", "", text)
Test on a sample before running across the whole corpus; some papers use parentheses for things that look like citations but aren't.
Equations
Equations in academic PDFs come in two forms, and the extraction differs:
- Image equations — common in older papers and some journal templates. The equation is a rendered image embedded in the page; extraction requires either OCR or a math-aware vision model.
- Typeset equations — the equation is positioned glyphs (sometimes from MathML, sometimes from STIX or Computer Modern fonts). Extraction tools handle these unevenly.
By tool:
- pymupdf4llm — extracts typeset equations as text-with-spaces, which is unusable. Drops image equations entirely.
- marker — converts both kinds to LaTeX. Quality is high for typeset; variable for image.
- Nougat — purpose-built for equation extraction; outputs LaTeX for both kinds with high accuracy.
- Vision models — given the right prompt ("transcribe this equation as LaTeX"), GPT-4o and Claude produce reliable LaTeX for both kinds.
For papers where equations matter (machine learning, physics, theoretical CS), use Nougat or marker. For papers where equations are incidental (most empirical work in many fields), pymupdf4llm with a note that some equations may be missing is fine.
Figures and captions
Figures in PDFs are usually embedded images with separate text for the caption. Extraction has to do three things: extract the image, extract the caption, and link them together.
- Image extraction: pymupdf can dump all embedded images to disk with
page.get_images(). Note that this gets you the raw image, not the figure-with-caption. - Caption extraction: look for paragraphs starting with "Figure N", "Fig. N", "Table N". Match by figure number to link back to the image.
- Linking: save the image as
figure-N.pngand store its caption with a reference. Most Markdown-aware extractors do this automatically with reasonable defaults.
For papers where figures don't matter (text-only analysis, RAG over prose), strip figure images and keep the captions only — the captions usually contain the figure's main point in compact form.
Footnotes
Footnotes appear at the bottom of a page in the source PDF but are often referenced mid-paragraph. Extraction tools handle them inconsistently:
- pymupdf — inserts footnotes inline at their position on the page, which usually means mid-sentence in the body text. Disorienting in the output.
- pymupdf4llm — tends to push footnotes to the end of the page chunk. Better but still surprising in places.
- marker — handles footnotes correctly, with markers in the body text linking to footnote text at the end of the section.
- GROBID — same as marker; structures footnotes explicitly.
For LLM input, stripping footnotes entirely is often the right call — they're rarely necessary for understanding and they crowd the context window.
Headers, footers, and page numbers
Every page of an academic paper has at least the running title and page number, often the journal name, volume, and date. These show up in the extracted text as repeated noise.
The patterns that strip them:
- Detect repetition. Find lines that appear on more than 70% of pages; they're almost always boilerplate.
- Position-based filtering. pymupdf can give you bounding boxes; lines in the top or bottom 5% of the page are usually headers/footers.
- Manual templates. For a specific journal you process often, hard-code the header/footer patterns. Faster and more accurate than heuristics.
For RAG pipelines especially, leaving this boilerplate in is a quiet quality killer — it pollutes embeddings and surfaces in retrieved chunks.
A practical pipeline for a folder of papers
A workflow that holds up:
- Filter to text-extractable PDFs. Run pymupdf's text extraction first; pages with no extractable text need OCR.
- Run marker or pymupdf4llm. Marker for higher quality, pymupdf4llm for speed.
- Strip headers, footers, and in-line citations. Use the patterns above.
- Extract and structure the bibliography separately. GROBID for accuracy, an LLM for ergonomics.
- Save the cleaned Markdown alongside the original PDF. Always keep the original for re-extraction when your pipeline improves.
For a small personal library, doing this one paper at a time with the converter on this site plus light manual cleanup is faster than building infrastructure. For thousands of papers, invest in marker or GROBID and the per-paper time drops dramatically.
← Back to all guides