How to Bulk-Convert a Folder of PDFs — CLI, Scripts, and Batch Workflows
Converting one PDF in the browser is fine. Converting 500 is a different problem entirely. This guide covers practical approaches for batch jobs: command-line tools, shell pipelines, Python scripts, and watched-folder setups — with realistic guidance on concurrency and error handling so you don't kill your laptop or burn through your API quota.
When you actually need to bulk-convert
A few realistic scenarios:
- Archival projects. Digitizing a paper library, a research lab's accumulated reports, or a small organization's document trove.
- Building a RAG corpus. Feeding internal documents into a retrieval-augmented LLM workflow. See the end-to-end PDF-to-LLM workflow.
- Periodic ingestion. A folder where vendors drop reports weekly, or where a scanner deposits new files daily.
- One-time migration. Pulling everything out of an old document management system before switching to a new one.
- Personal cleanup. Years of accumulated Downloads and email attachments.
The constants across all of these: you want automation, you need failure recovery, and you don't want to babysit it.
The simplest approach: pdfs2txt CLI
The pdfs2txt.py script handles a folder of PDFs in one invocation:
python pdfs2txt.py /path/to/pdfs/ -o /path/to/output/
Add --image-processor tesseract for OCR on scanned pages. Output is one .md file per input PDF, preserving the original filename.
This is the right tool when:
- You have a one-shot batch
- All PDFs need the same processing settings
- You're OK with sequential processing (one PDF at a time)
The limits show up at scale: no built-in parallelism, no resume-on-failure, no per-file error logging. Fine for a few hundred PDFs; less fine for 10,000.
Adding parallelism
For N CPU cores, you can run N conversions in parallel with GNU parallel:
find /path/to/pdfs -name "*.pdf" | \
parallel -j 4 python pdfs2txt.py {} -o /path/to/output/
Pick -j based on CPU cores and memory — each pymupdf process uses 200–500 MB on a medium PDF. Don't over-parallelize OCR jobs: Tesseract is CPU-heavy, and four parallel Tesseract jobs on a four-core machine saturates the CPU.
For cloud-API-based OCR (OpenAI, Gemini), parallelism is bounded by API rate limits, not local cores. Most providers give 50–500 requests per minute on free tiers and higher limits on paid plans. Start with -j 4 and watch for 429 responses before increasing.
A Python script for finer control
A complete script that walks a directory tree, processes PDFs in parallel, skips already-converted files (resumable on restart), and logs results per file:
import logging
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import pymupdf4llm
logging.basicConfig(
filename="conversion.log",
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
def convert_one(pdf_path: Path, output_dir: Path) -> tuple[Path, str | None]:
out_path = output_dir / pdf_path.with_suffix(".md").name
if out_path.exists():
return pdf_path, "skipped"
try:
md = pymupdf4llm.to_markdown(str(pdf_path))
# Atomic write: write to .tmp first, then rename
tmp_path = out_path.with_suffix(".tmp")
tmp_path.write_text(md, encoding="utf-8")
tmp_path.rename(out_path)
return pdf_path, None
except Exception as e:
return pdf_path, f"{type(e).__name__}: {e}"
def main(input_dir: Path, output_dir: Path, workers: int = 4) -> None:
output_dir.mkdir(parents=True, exist_ok=True)
pdf_files = list(input_dir.rglob("*.pdf"))
logging.info("Converting %d files with %d workers", len(pdf_files), workers)
with ProcessPoolExecutor(max_workers=workers) as pool:
futures = {pool.submit(convert_one, p, output_dir): p for p in pdf_files}
for fut in as_completed(futures):
pdf, err = fut.result()
if err is None:
logging.info("converted %s", pdf.name)
elif err == "skipped":
logging.info("skipped %s (output exists)", pdf.name)
else:
logging.error("failed %s: %s", pdf.name, err)
if __name__ == "__main__":
import sys
main(Path(sys.argv[1]), Path(sys.argv[2]))
This script handles the three things the simple CLI doesn't: parallelism, resume-on-restart (via the existence check), and per-file error logging. If a process crashes mid-batch, restart the script and it picks up where it left off.
Watching a folder for new files
For pipelines where PDFs arrive continuously, use watchdog (Python) or entr (shell) to trigger conversion on new files.
A quick entr setup:
ls /inbox/*.pdf | entr -p python pdfs2txt.py /inbox/ -o /processed/
For production, a small Python daemon using watchdog is more robust:
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
import time
import subprocess
class PDFHandler(FileSystemEventHandler):
def on_created(self, event):
if event.src_path.endswith(".pdf"):
# Wait for the file to finish copying
self.wait_for_stable_size(event.src_path)
subprocess.run(["python", "pdfs2txt.py", event.src_path, "-o", "/processed/"])
def wait_for_stable_size(self, path, stable_for=2.0):
import os
last = -1
while True:
size = os.path.getsize(path)
if size == last:
return
last = size
time.sleep(stable_for)
The wait_for_stable_size step is important: a PDF being uploaded or copied is not ready to convert. Watch for file size to stabilize for 2–3 seconds before processing.
Cost and time estimates
Concrete numbers for common batch sizes (your hardware will vary):
- 100 digital PDFs, no OCR: about 5 minutes on an M1 MacBook
- 100 PDFs with Tesseract OCR: 30–60 minutes
- 1000 PDFs with cloud vision API: 3–5 hours wall time, ~$30–50 in API costs depending on page count
- 10,000-PDF archive: budget a weekend and a robust restart-on-failure script
For cloud OCR specifically, do a 10-PDF dry run first to validate cost projections before throwing the full batch at the API.
Error handling patterns
The most common failures and how to handle them:
- Corrupt PDFs. Catch the exception per-file and continue; log to a separate
failures.logfor later inspection. Don't let one bad PDF kill the batch. - Out-of-memory on huge PDFs. Very large documents (1000+ pages) can blow up pymupdf. Cap per-process memory with
resource.setrlimit(Unix), or skip files above a page-count threshold. - API rate limits. Catch HTTP 429 responses, sleep with exponential backoff, retry. The
tenacitylibrary handles this cleanly. - Disk full. Check free space before starting. OCR output is typically 5–15× smaller than input PDFs, so this rarely bites, but it does happen on small VMs.
- Partial output. Use atomic writes (write to
.tmp, rename when done) so consumers don't pick up half-converted files.
Resume-on-failure as a first-class feature
For batches large enough to take hours, design for restartability from the start:
- Check whether the output file exists before converting (the script above does this)
- Log the input path and timestamp for every successful conversion
- Log failures separately with the exception details
- Use atomic writes so a kill -9 doesn't leave half-files
These four practices turn a fragile batch script into a reliable one. Without them, a single OOM kill at hour 4 wastes the previous hours of work.
Conclusion
Start with the simple pdfs2txt directory-input CLI. Add parallel when sequential becomes the bottleneck. Move to a Python script when you need resume-on-failure, watched directories, or per-file error reporting. Test your pipeline on 10 PDFs before throwing 10,000 at it — the edge cases that show up in the first hundred usually predict the next thousand.
For batch jobs where you don't want to maintain a Python environment, the converter on this site handles single PDFs with the same underlying tools. The local CLI from the same project handles directories.
← Back to all guides