Guides

Practical guides on PDF text extraction, OCR, Markdown workflows, and feeding documents to large language models. Read these alongside the converter to get more out of your PDFs.

Extracting Structured JSON Data from PDFs — Schemas, Tools, and Validation
2026-06-06 · 6 min read

How to turn unstructured PDF content into clean JSON. Schema design, regex vs layout vs LLM extraction, and validating the output you get back.
Fixing Reading Order in Multi-Column and Magazine PDFs
2026-06-05 · 5 min read

Why two-column papers and magazine layouts come out scrambled when converted, how reading order detection works, and how to get coherent text back.
Measuring and Improving OCR Accuracy — CER, WER, and What "95%" Really Means
2026-06-04 · 5 min read

How OCR accuracy is actually measured, why character and word error rates differ, how to benchmark a tool on your own documents, and how to push accuracy up.
Extracting Images and Figures from PDFs — Embedded Bitmaps vs Rendered Pages
2026-06-03 · 5 min read

How to pull images, charts, and figures out of a PDF. The difference between embedded image extraction and page rendering, resolution gotchas, and tools.
Extracting Data from Invoice PDFs at Scale — Fields, Tools, and Accuracy
2026-06-02 · 5 min read

A practical guide to pulling structured data from invoice PDFs. Which fields to target, prebuilt vs custom models, line-item extraction, and validation.
Converting PDFs to EPUB and Ebook Formats — Reflowable Text from Fixed Pages
2026-06-01 · 5 min read

How to turn a fixed-layout PDF into a reflowable EPUB that works on e-readers. Why it's hard, the Markdown-as-intermediate approach, tools, and cleanup.
Extracting Bookmarks and Tables of Contents from PDFs
2026-05-31 · 4 min read

How to pull a PDF's outline, bookmarks, and table of contents — the difference between the real outline and a printed TOC, tools, and turning it into Markdown.
Document Chunking Strategies for LLMs and RAG — Fixed, Recursive, and Semantic
2026-05-30 · 6 min read

How to split converted documents into chunks for embeddings and retrieval. Chunk size, overlap, structure-aware splitting, and the mistakes that hurt recall.
Reading, Editing, and Stripping PDF Metadata — Document Info, XMP, and Hidden Data
2026-05-29 · 4 min read

What metadata PDFs carry, how to read and edit it, and how to strip hidden data before sharing. Document info dictionary vs XMP, tools, and privacy risks.
Extracting Tables from PDFs into CSV and Excel — A Practical Workflow
2026-05-25 · 8 min read

How to get tables out of PDFs and into a spreadsheet without losing rows, merging cells, or scrambling columns. Tools, scripts, and post-cleanup techniques.
Building a RAG Pipeline from a PDF Library — Chunking, Embeddings, Retrieval
2026-05-24 · 8 min read

A practical guide to retrieval-augmented generation over PDFs. How to chunk, embed, store, and retrieve documents so an LLM can answer questions over your library.
Cloud OCR Services Compared — AWS Textract, Azure Document Intelligence, Google Document AI
2026-05-23 · 7 min read

A side-by-side look at the three major cloud OCR services. Accuracy on real documents, table and form support, pricing, and how to pick for your workload.
Multi-Language OCR — Handling Non-English and Mixed-Script Documents
2026-05-22 · 7 min read

How to OCR documents in non-English languages and mixed-script pages. Tesseract language packs, vision models, right-to-left scripts, and CJK gotchas.
Extracting Academic Papers — Citations, References, Figures, and Equations
2026-05-21 · 7 min read

A focused guide on getting research papers out of PDF into clean text. Handling references, in-line citations, equations, figures, and two-column layouts.
Image Preprocessing for OCR — DPI, Deskew, Contrast, Binarization
2026-05-20 · 6 min read

The image-prep steps that determine OCR accuracy. Resolution, deskew, denoise, binarization, and which transformations help vs hurt for each engine.
PDF Accessibility — Tagged PDFs, Screen Readers, and Making Output Inclusive
2026-05-19 · 8 min read

How PDFs work for screen readers, what "tagged PDF" means, and how to convert PDFs into accessible Markdown and HTML that works for everyone.
PDF Redaction Done Right — Why Black Boxes Aren't Enough
2026-05-18 · 7 min read

How to redact sensitive content in PDFs so it actually disappears. Why visual blackouts leak, proper redaction tools, and verification techniques.
Extracting Data from PDF Forms — AcroForms, XFA, and Scanned Forms
2026-05-17 · 8 min read

How PDF forms work, why some are easy to extract and others impossible, and the workflows for each kind. AcroForms, XFA, and OCR-based form extraction.
Extracting Highlights, Notes, and Comments from Annotated PDFs
2026-05-16 · 7 min read

How to pull annotations out of a PDF — highlights, sticky notes, comments, and underlines — into Markdown, JSON, or a notes app. Tool comparison and scripts.
How to Convert Scanned PDFs to Searchable Text — A Complete Guide
2026-05-14 · 8 min read

Step-by-step guide to extracting text from scanned PDFs. When OCR is needed, tool comparison, accuracy tips, and post-processing techniques.
Markdown vs Plain Text — Which Format Should You Feed ChatGPT and Claude?
2026-05-13 · 6 min read

A practical comparison of Markdown, plain text, and other formats for feeding PDFs to large language models. Token usage, accuracy, and structural fidelity.
Tesseract OCR Explained — Strengths, Weaknesses, and Tuning Tips
2026-05-12 · 6 min read

A practical guide to Tesseract OCR. How the engine works, where it shines, where it fails, and how to get better results from it on real documents.
Converting Research Papers to Markdown for Obsidian, Notion, and Logseq
2026-05-11 · 6 min read

A practical workflow for getting PDFs out of your downloads folder and into your note-taking app of choice. Covers conversion, cleanup, linking, and search.
PDF Text Extraction Methods Compared — pymupdf, pdfplumber, pdfminer, and OCR
2026-05-10 · 6 min read

A side-by-side comparison of the major Python PDF extraction libraries plus OCR fallback. Output quality, speed, table handling, and when to use each.
Why Your PDF Text Won't Copy — Encoding, Fonts, and Image-Only Pages Explained
2026-05-09 · 6 min read

Why some PDFs let you copy text and others don't. Covers font encoding, ToUnicode maps, image-only pages, and how to fix the most common cases.
OCRing Handwritten Documents — Workflow, Tools, and Realistic Accuracy
2026-05-08 · 6 min read

How to extract text from handwritten notes, letters, and journals. Why traditional OCR fails, when vision models succeed, and how to get usable output.
The Best Markdown Editors in 2026 for Working with Converted PDF Content
2026-05-07 · 5 min read

A practical roundup of Markdown editors built for handling long-form converted documents. Features, performance, sync, and which to pick for which workflow.
How to Bulk-Convert a Folder of PDFs — CLI, Scripts, and Batch Workflows
2026-05-06 · 5 min read

Practical scripts and tools for converting hundreds or thousands of PDFs in one pass. Concurrency, error handling, and watching directories for new files.
Preserving Tables When Converting PDF to Markdown — Why It's Hard and How to Fix It
2026-05-05 · 6 min read

A deep dive on extracting tables from PDFs into Markdown. Why most converters mangle them, which tools do it well, and manual cleanup techniques.
Building an End-to-End PDF-to-LLM Workflow for Research and Knowledge Work
2026-05-04 · 7 min read

A complete pipeline for turning a folder of PDFs into a searchable, queryable knowledge base. Conversion, chunking, embeddings, retrieval, and prompting.
PDF Privacy and Security — What Happens to Your Document When You Convert It Online
2026-05-03 · 6 min read

An honest look at the privacy implications of online PDF converters. What gets logged, who sees your files, and how to make safe choices for sensitive documents.