Image Preprocessing for OCR — DPI, Deskew, Contrast, Binarization
OCR accuracy is more sensitive to image quality than to engine choice. The same Tesseract installation can produce 60% accuracy on a poorly-prepped image and 97% accuracy on the same content scanned and processed properly. Modern vision models are more forgiving but still benefit substantially from clean inputs.
This guide walks through the preprocessing steps that consistently improve OCR output and the ones that look helpful but actually hurt.
DPI: the most important variable
Resolution is the single biggest lever. Tesseract, in particular, is calibrated for content that's roughly 30–40 pixels tall per character — equivalent to body text rendered at 300 DPI. Below that, the engine's character segmenter starts failing. Above 600 DPI, you're spending compute without accuracy gain.
The practical guidance:
- Born-digital PDFs: render to 300 DPI for OCR (using pymupdf:
page.get_pixmap(dpi=300)). The PDF doesn't care what DPI it's "at"; you choose the rendering resolution. - Scans: ideally re-scan at 300–400 DPI. If you're stuck with a 150 DPI scan, upscale to 300 with bicubic or Lanczos interpolation. Don't upscale beyond ~2× — past that, you're adding artifacts not information.
- Photos of documents: higher is usually better, up to a point. A 4000×3000 phone photo of a single page is roughly equivalent to a 400 DPI scan, which is fine.
- Handwriting: 600 DPI minimum, even for born-digital documents. The fine detail of pen strokes carries information that's lost at lower resolutions. See handwritten OCR.
For batch jobs, the rendering cost scales with DPI squared. 600 DPI uses 4× the memory and compute of 300 DPI. Don't oversample if you don't need to.
Deskew
Documents fed through a scanner or photographed by hand are rarely perfectly aligned. Even a 1–2° rotation hurts OCR accuracy because the engine's character segmenter assumes horizontal baselines.
Detection: most OCR engines can detect skew but won't correct it for you. Use a separate deskew step:
import cv2
import numpy as np
def deskew(image):
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
coords = np.column_stack(np.where(cv2.bitwise_not(gray) > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
h, w = image.shape[:2]
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
For more robust deskew across varied documents, the OpenCV approach above is OK but deskew (the Python package) or ImageMagick's -deskew 40% produce slightly better results on tricky cases.
When not to deskew: documents where the rotation is intentional (a landscape page, a rotated form). Most deskew algorithms can't tell the difference and will rotate the whole page. Add a check: if the detected angle is greater than ~20°, leave it alone.
Denoising
Scans often have noise: paper texture, JPEG compression artifacts, dust speckles, residual stamps. Denoising helps for traditional OCR engines but is risky.
For Tesseract:
- Gaussian blur with a 1–2 pixel radius removes high-frequency noise without blurring character shapes meaningfully. Helpful on noisy scans.
- Median filter is better than Gaussian for salt-and-pepper noise (pepper dots, JPEG hot pixels) — it removes specks without softening edges.
- Bilateral filter preserves edges while smoothing flat regions — overkill for most OCR but useful for poor-quality phone photos.
For vision models:
- Mild denoising is fine; aggressive denoising can hurt. Vision models are robust to texture and noise but sensitive to characters whose edges have been softened.
- When in doubt, send the original image. Vision models are trained on real-world noisy images and often perform better on slight noise than on over-cleaned input.
The anti-pattern: applying heavy denoising on already-clean digital documents. Removes nothing useful, softens character edges, hurts accuracy.
Binarization
Binarization converts a grayscale or color image to pure black-and-white. Critical for Tesseract (which works best on binary input), irrelevant or even harmful for vision models.
Two main approaches:
- Global threshold (Otsu's method). Picks a single threshold for the whole image, separating dark pixels from light. Works on documents with uniform background.
- Adaptive threshold. Uses a local threshold computed per region. Necessary for documents with uneven lighting (phone photos, faded ink, dark page corners).
import cv2
# Otsu — uniform background
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Adaptive — uneven lighting
binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 31, 10)
Tune the block size (31 above) to roughly 2–3× the height of a character. Tuning the constant offset (10) controls how aggressively the threshold separates ink from paper.
For images going to a vision model, skip binarization. The vision model uses grayscale and color information; throwing it away hurts. Tesseract is the only major engine that strictly needs binary input.
Contrast and levels
For faded ink, poor scans, or printed material with low contrast:
# ImageMagick: stretch the histogram, pushing the lightest to white and darkest to black
convert input.png -auto-level output.png
# More aggressive: stretch with explicit cutoffs
convert input.png -level 20%,80% output.png
# Increase contrast without clipping
convert input.png -contrast -contrast output.png
The -level 20%,80% form is the workhorse — it remaps the 20th percentile of pixel intensities to black and the 80th percentile to white, stretching everything in between. Effective for most documents that look "dim" or "washed out".
For very faded documents (carbon copies, old thermal-paper receipts), pushing the levels harder (-level 35%,75%) recovers more text but starts to introduce false strokes from paper texture. Test on a sample before batch-processing.
Cropping
OCR engines don't waste time on white space, but they do sometimes pick up content from page edges (binder shadows, ruler marks, hand fingertips in phone photos) that isn't part of the document.
Practical cropping:
- Border crop: Trim 1–2% from each edge. Catches most binder shadows and scanner edge artifacts.
- Content-bounded crop: Detect the document's actual rectangle and crop to it. Useful for phone photos where the page sits on a desk.
For phone photos specifically, a "perspective correction" step that detects the four corners of the document and warps it to a flat rectangle has dramatic effects on OCR accuracy. The OpenCV approach:
import cv2
import numpy as np
def perspective_correct(image, corners):
"""corners: 4 (x,y) points in TL, TR, BR, BL order."""
h, w = 1100, 850 # target output size
src = np.array(corners, dtype=np.float32)
dst = np.array([[0,0],[w,0],[w,h],[0,h]], dtype=np.float32)
M = cv2.getPerspectiveTransform(src, dst)
return cv2.warpPerspective(image, M, (w, h))
Mobile document scanners (Adobe Scan, Microsoft Lens, Apple Notes) do this automatically. If you're using one of them as the capture path, the perspective is already corrected.
Color handling
Most OCR engines convert to grayscale internally; passing a color image just slows them down. But:
- Highlighter marks (yellow, pink, green) become invisible when desaturated to grayscale. If you need to preserve highlights, keep color and post-process separately.
- Colored text on white background OCRs fine after grayscale conversion.
- White text on colored background (some headers, some signs) sometimes inverts during grayscale — give Tesseract
-c tessedit_do_invert=1or invert the image first.
For vision models, send color. They use it.
What the engines do internally
Knowing what the engine does automatically tells you what not to do:
- Tesseract internally: converts to grayscale, applies its own binarization, segments lines. Skip its preprocessing only if your version is unsatisfactory.
- EasyOCR internally: works on color images, does its own denoising. Adding aggressive preprocessing hurts.
- PaddleOCR internally: similar to EasyOCR — works on color, does light preprocessing.
- Cloud OCR services (Textract, Azure DI, Document AI): do extensive internal preprocessing. Send the original at a reasonable resolution.
- Vision models (GPT-4o, Claude, Gemini): trained on a vast variety of natural images. Pre-resize to 1024–2048 pixels on the long side, otherwise send the original. Over-preprocessing usually hurts.
The high-level rule: traditional OCR engines (Tesseract being the main one) benefit from aggressive preprocessing; modern ML-based engines do not.
A pragmatic preprocessing pipeline
For Tesseract on scanned documents:
- Upscale to 300 DPI if lower.
- Convert to grayscale.
- Light Gaussian blur (1 px radius) for noisy scans.
- Deskew.
- Adaptive threshold.
For vision models:
- Resize to 1024–2048 px on the long side.
- Send as-is. Skip everything else.
For the converter on this site, preprocessing happens automatically before OCR. If you're running OCR locally and getting poor results, the highest-leverage fixes are checking the DPI and turning off any preprocessing that isn't earning its keep.
← Back to all guides