How to Bulk-Convert a Folder of PDFs — CLI, Scripts, and Batch Workflows

Converting one PDF in the browser is fine. Converting 500 is a different problem entirely. This guide covers practical approaches for batch jobs: command-line tools, shell pipelines, Python scripts, and watched-folder setups — with realistic guidance on concurrency and error handling so you don't kill your laptop or burn through your API quota.

When you actually need to bulk-convert

A few realistic scenarios:

The constants across all of these: you want automation, you need failure recovery, and you don't want to babysit it.

The simplest approach: pdfs2txt CLI

The pdfs2txt.py script handles a folder of PDFs in one invocation:

python pdfs2txt.py /path/to/pdfs/ -o /path/to/output/

Add --image-processor tesseract for OCR on scanned pages. Output is one .md file per input PDF, preserving the original filename.

This is the right tool when:

The limits show up at scale: no built-in parallelism, no resume-on-failure, no per-file error logging. Fine for a few hundred PDFs; less fine for 10,000.

Adding parallelism

For N CPU cores, you can run N conversions in parallel with GNU parallel:

find /path/to/pdfs -name "*.pdf" | \
  parallel -j 4 python pdfs2txt.py {} -o /path/to/output/

Pick -j based on CPU cores and memory — each pymupdf process uses 200–500 MB on a medium PDF. Don't over-parallelize OCR jobs: Tesseract is CPU-heavy, and four parallel Tesseract jobs on a four-core machine saturates the CPU.

For cloud-API-based OCR (OpenAI, Gemini), parallelism is bounded by API rate limits, not local cores. Most providers give 50–500 requests per minute on free tiers and higher limits on paid plans. Start with -j 4 and watch for 429 responses before increasing.

A Python script for finer control

A complete script that walks a directory tree, processes PDFs in parallel, skips already-converted files (resumable on restart), and logs results per file:

import logging
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import pymupdf4llm

logging.basicConfig(
    filename="conversion.log",
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)


def convert_one(pdf_path: Path, output_dir: Path) -> tuple[Path, str | None]:
    out_path = output_dir / pdf_path.with_suffix(".md").name
    if out_path.exists():
        return pdf_path, "skipped"
    try:
        md = pymupdf4llm.to_markdown(str(pdf_path))
        # Atomic write: write to .tmp first, then rename
        tmp_path = out_path.with_suffix(".tmp")
        tmp_path.write_text(md, encoding="utf-8")
        tmp_path.rename(out_path)
        return pdf_path, None
    except Exception as e:
        return pdf_path, f"{type(e).__name__}: {e}"


def main(input_dir: Path, output_dir: Path, workers: int = 4) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    pdf_files = list(input_dir.rglob("*.pdf"))
    logging.info("Converting %d files with %d workers", len(pdf_files), workers)

    with ProcessPoolExecutor(max_workers=workers) as pool:
        futures = {pool.submit(convert_one, p, output_dir): p for p in pdf_files}
        for fut in as_completed(futures):
            pdf, err = fut.result()
            if err is None:
                logging.info("converted %s", pdf.name)
            elif err == "skipped":
                logging.info("skipped %s (output exists)", pdf.name)
            else:
                logging.error("failed %s: %s", pdf.name, err)


if __name__ == "__main__":
    import sys
    main(Path(sys.argv[1]), Path(sys.argv[2]))

This script handles the three things the simple CLI doesn't: parallelism, resume-on-restart (via the existence check), and per-file error logging. If a process crashes mid-batch, restart the script and it picks up where it left off.

Watching a folder for new files

For pipelines where PDFs arrive continuously, use watchdog (Python) or entr (shell) to trigger conversion on new files.

A quick entr setup:

ls /inbox/*.pdf | entr -p python pdfs2txt.py /inbox/ -o /processed/

For production, a small Python daemon using watchdog is more robust:

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
import time
import subprocess

class PDFHandler(FileSystemEventHandler):
    def on_created(self, event):
        if event.src_path.endswith(".pdf"):
            # Wait for the file to finish copying
            self.wait_for_stable_size(event.src_path)
            subprocess.run(["python", "pdfs2txt.py", event.src_path, "-o", "/processed/"])

    def wait_for_stable_size(self, path, stable_for=2.0):
        import os
        last = -1
        while True:
            size = os.path.getsize(path)
            if size == last:
                return
            last = size
            time.sleep(stable_for)

The wait_for_stable_size step is important: a PDF being uploaded or copied is not ready to convert. Watch for file size to stabilize for 2–3 seconds before processing.

Cost and time estimates

Concrete numbers for common batch sizes (your hardware will vary):

For cloud OCR specifically, do a 10-PDF dry run first to validate cost projections before throwing the full batch at the API.

Error handling patterns

The most common failures and how to handle them:

Resume-on-failure as a first-class feature

For batches large enough to take hours, design for restartability from the start:

These four practices turn a fragile batch script into a reliable one. Without them, a single OOM kill at hour 4 wastes the previous hours of work.

Conclusion

Start with the simple pdfs2txt directory-input CLI. Add parallel when sequential becomes the bottleneck. Move to a Python script when you need resume-on-failure, watched directories, or per-file error reporting. Test your pipeline on 10 PDFs before throwing 10,000 at it — the edge cases that show up in the first hundred usually predict the next thousand.

For batch jobs where you don't want to maintain a Python environment, the converter on this site handles single PDFs with the same underlying tools. The local CLI from the same project handles directories.

← Back to all guides