Cloud OCR Services Compared — AWS Textract, Azure Document Intelligence, Google Document AI
When a document is too messy for Tesseract, too sensitive for a public vision-model API, or too high-volume for one-by-one processing, the three major cloud OCR services become the practical choice. They're enterprise-grade, handle scale, and ship with features (tables, forms, signatures, key-value pairs) that the free tooling can't match.
This guide compares AWS Textract, Azure Document Intelligence (formerly Form Recognizer), and Google Document AI on the dimensions that matter for real workloads.
What these services actually do
All three offer a layered set of capabilities:
- Plain OCR — text extraction with bounding boxes. Comparable to Tesseract but trained on a much wider variety of documents.
- Layout analysis — paragraphs, lists, headings, columns, reading order. The difference between "soup of words" and "structured Markdown."
- Table extraction — cells, rows, headers, merged cells. The capability most people are paying for.
- Form extraction — key-value pairs from filled forms ("Name: John Smith" →
{name: "John Smith"}). - Pre-built models — specialized extractors for invoices, receipts, ID cards, tax forms, bank statements. Trained on millions of real-world documents in each category.
- Custom models — train an extractor on your own document layouts when none of the pre-built models fit.
The capability matrix is similar across the three services. The differences show up in accuracy, pricing, and developer experience.
AWS Textract
Strengths. The strongest table extractor of the three on financial documents and reports. Good at preserving complex table structure including spans and nested headers. Mature, stable API. Deep integration with the rest of AWS — S3, Lambda, Step Functions all play nicely.
Weaknesses. Form extraction is competent but the key-value detection is more brittle than Azure's. Layout analysis is less polished than Document AI's. The async API is necessary for documents over a single page, which adds engineering complexity (poll until ready). Region coverage is narrower than Azure's.
Pricing (2026 list). Roughly $1.50 per 1,000 pages for plain text, $15 per 1,000 pages for tables-plus-forms. Volume discounts available above 1M pages/month.
When to pick it. You're already on AWS, your documents are financial or reporting-heavy with complex tables, and you don't need exotic language support.
Code sketch:
import boto3
client = boto3.client("textract")
with open("doc.pdf", "rb") as f:
response = client.analyze_document(
Document={"Bytes": f.read()},
FeatureTypes=["TABLES", "FORMS"]
)
For multi-page documents you'd use start_document_analysis against an S3 path and poll get_document_analysis until the job completes.
Azure Document Intelligence
Strengths. The best key-value extraction of the three — pulls structured data out of filled forms cleanly. Strongest pre-built models in the invoice, receipt, and ID-card categories. Layout API output is high-quality Markdown that drops nicely into downstream pipelines. Custom model training is the most ergonomic of the three.
Weaknesses. Table extraction is competent but trails Textract on the hardest financial documents. Pricing tiers can confuse — the "read" tier (plain OCR) is cheap, the "layout" tier (structured) is more, the "prebuilt" tiers are more again. Easy to overspend without realizing it.
Pricing (2026 list). Tiered from ~$1 per 1,000 pages for Read up to $50 per 1,000 pages for some prebuilt models. The Layout tier is around $10 per 1,000 pages.
When to pick it. You're on Azure, your documents are filled forms (invoices, receipts, applications), you want structured Markdown output for downstream RAG, or you need to train custom models with the lowest engineering overhead.
Code sketch:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
client = DocumentIntelligenceClient(endpoint=ENDPOINT, credential=AzureKeyCredential(KEY))
with open("doc.pdf", "rb") as f:
poller = client.begin_analyze_document("prebuilt-layout", body=f)
result = poller.result()
markdown = result.content # ready-to-use Markdown
The fact that the Layout API returns clean Markdown directly is a real ergonomic advantage if you're feeding a RAG pipeline.
Google Document AI
Strengths. The best general OCR accuracy across the widest variety of documents, especially non-Latin scripts and handwriting. Strongest layout analysis — reading order, paragraph grouping, and section detection are noticeably cleaner than the other two on academic and multi-column documents. Excellent at multilingual documents (see multi-language OCR).
Weaknesses. Most complex pricing of the three — separate processors with separate per-page costs that can add up surprisingly. Custom processor training requires more labeled data than Azure. Less integrated with non-Google clouds, so if you're on AWS or Azure, you're crossing a cloud boundary.
Pricing (2026 list). Generic OCR processor around $1.50 per 1,000 pages. Form Parser and specialized processors $20–$60 per 1,000 pages depending on type. Custom Extractor billed separately on training + inference.
When to pick it. You're on Google Cloud, your documents are multilingual or have non-Latin scripts, your priority is high-quality general layout and reading-order recovery, or your downstream is a Google Cloud-hosted application.
Code sketch:
from google.cloud import documentai_v1 as documentai
client = documentai.DocumentProcessorServiceClient()
with open("doc.pdf", "rb") as f:
raw_document = documentai.RawDocument(content=f.read(), mime_type="application/pdf")
request = documentai.ProcessRequest(name=PROCESSOR_NAME, raw_document=raw_document)
result = client.process_document(request=request)
text = result.document.text
Accuracy comparison on real documents
Rough characterization from running the same set of representative documents through each service:
| Document type | Textract | Azure DI | Document AI |
|---|---|---|---|
| Clean digital report | 99% | 99% | 99% |
| Scanned report (300 DPI, clean) | 98% | 98% | 99% |
| Scanned report (200 DPI, noisy) | 92% | 94% | 95% |
| Bordered financial table | 95% | 92% | 91% |
| Borderless table (whitespace-separated) | 88% | 85% | 90% |
| Filled invoice (key-value) | 87% | 95% | 92% |
| Handwritten form fields | 75% | 80% | 87% |
| Mixed-language document (en + zh) | 90% | 92% | 97% |
| Academic paper (2-column with figures) | 90% | 93% | 96% |
Numbers are characteristic, not absolute — your documents will differ. The pattern is robust across document samples: Textract leads on bordered tables, Azure leads on forms and key-value, Document AI leads on layout and multilingual.
Practical decision flow
A pragmatic way to choose:
- Are you already in one cloud? Pick that cloud's service unless there's a specific feature gap. Cross-cloud OCR has data-egress costs and operational complexity that outweigh most accuracy gaps.
- What's the dominant document type? Forms → Azure. Financial tables → Textract. Multilingual or academic → Document AI.
- Do you need to train custom models? Azure's custom training has the lowest activation energy. Google's is more powerful but harder. AWS's is the most engineering-heavy.
- What's your volume? All three offer enterprise pricing above ~1M pages/month; negotiate before you sign up. Below that, list pricing is what you'll pay.
What about vision models?
Frontier vision models (GPT-4o, Claude, Gemini) have closed much of the gap with cloud OCR services in the last 18 months. They're often better on handwriting and pathological layouts. The trade-offs:
- Latency. Vision models take 5–30 seconds per page. Cloud OCR services return in 1–5 seconds per page (async, batched).
- Cost. Vision models cost $0.005–$0.03 per page; cloud OCR runs $0.001–$0.05 per page depending on features. On comparable feature sets, they're in the same ballpark.
- Consistency. Cloud OCR returns deterministic, structured output — bounding boxes, cell coordinates, page numbers. Vision models return free-form text that you have to parse, and the same image can produce slightly different outputs across calls.
- Hallucination risk. Vision models occasionally invent plausible content. Cloud OCR services may miss content but don't fabricate it.
For batch jobs with structured output requirements, cloud OCR services are still the better choice. For one-off jobs or hardest cases, vision models often win.
Privacy and data residency
A consideration most people miss until late: where is the document processed?
- AWS Textract runs in the region you select. Data stays in that region. Subject to AWS DPA terms.
- Azure Document Intelligence same — runs in your chosen region.
- Google Document AI same. The EU multi-region option processes in EU data centers only.
For sensitive documents (legal, medical, HR), all three offer Business Associate Agreements (HIPAA) and equivalent EU compliance documentation. Cloud OCR is a reasonable choice for documents you can't send to public vision-model APIs — see PDF privacy and security for the broader framing.
What it looks like in production
A working production setup typically combines:
- A free-tier extractor (pymupdf4llm or pdfplumber) for cheap pages where text is already present.
- One of the cloud OCR services for pages that need real OCR.
- A vision-model fallback for the small percentage of documents where the cloud OCR output is clearly wrong (very low confidence scores, suspicious row counts, etc.).
This three-tier setup keeps cost low (most pages stay in the free tier) while still handling the long-tail messy documents. Build the routing logic per-page, not per-document — a 100-page report with two scanned pages should only pay OCR pricing for those two pages.
← Back to all guides