Imp Imp
← Back to Arsenal
PDF OCR API - Document Extraction

PDF OCR API - Document Extraction

Ai

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

PDF OCR API

"Extract Text from Any PDF for Pennies" by John Rippy | johnrippy.link
🏆 2025 Zapier Automation Hero of the YearProject Phoenix: A 95-step AI sales pipeline cutting development time by 50%. Read more →

---

Stop Paying for Expensive OCR Services

You're currently paying: Adobe Acrobat Pro ($22.99/mo), ABBYY FineReader ($199/year), Google Document AI ($1.50/1000 pages), Amazon Textract ($1.50/1000 pages). What if you could extract text for a fraction of the cost?

The PDF OCR API extracts text from any PDF - scanned documents, image-based PDFs, and multi-page files:

Pay only for what you use. No monthly subscriptions. No minimum commitments.

---

Why Choose This Over Traditional OCR Services

1. Pay-Per-Page, Not Per-Month

Traditional tools: $20-$200/month for your business.

This actor: Pay per page processed. Process 100 pages for ~$5. Process 1,000 for ~$40.

Process 500 pages/month and still pay less than an Adobe subscription.

2. Support for Any PDF

3. 14 Languages Supported

4. Table Detection

Preserve table structure from scanned documents. Get rows and columns as structured data.

---

Quick Start Examples

Example 1: Extract Text from URL

{

"pdfUrl": "https://example.com/document.pdf",

"language": "eng",

"outputFormat": "text"

}

Example 2: Process Specific Pages

{

"pdfUrl": "https://example.com/document.pdf",

"pageRange": "1-5",

"language": "eng",

"detectTables": true

}

Example 3: Multi-Language Document

{

"pdfUrl": "https://example.com/document.pdf",

"language": "spa",

"outputFormat": "json"

}

Example 4: With Webhook

{

"pdfUrl": "https://example.com/document.pdf",

"webhookUrl": "https://hooks.zapier.com/hooks/catch/12345/abcdef/"

}

---

Input Parameters

*Either pdfUrl or pdfBase64 is required

---

Output Format

{

"success": true,

"fileName": "document.pdf",

"totalPages": 5,

"processedPages": 5,

"language": "eng",

"processingTime": 2.3,

"pages": [

{

"pageNumber": 1,

"text": "This is the extracted text from page 1...",

"confidence": 95.2,

"wordCount": 342,

"hasImages": true,

"tables": [

{

"rows": 5,

"columns": 3,

"data": [["Header1", "Header2", "Header3"], ...]

}

]

}

],

"fullText": "Complete document text concatenated...",

"wordCount": 1250,

"averageConfidence": 94.5

}

---

Pay-Per-Event Pricing

You only pay for what you use. No monthly fees. No minimums.

Cost Examples

For low-to-medium volume, save compared to subscriptions. For high volume, competitive with cloud APIs.

---

Use Cases

Document Digitization

Data Extraction

Research & Academia

Legal & Compliance

Developers

---

Confidence Scores

Each page includes a confidence score (0-100%):

Low confidence usually indicates:

---

API Integration

Using the Apify API (JavaScript)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('localhowl/pdf-ocr-api').call({

pdfUrl: 'https://example.com/document.pdf',

language: 'eng',

outputFormat: 'json'

});

const { items } = await client.dataset(run.defaultDatasetId).listItems();

console.log(items[0].fullText);

Using cURL

curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \

-H "Content-Type: application/json" \

-d '{

"pdfUrl": "https://example.com/document.pdf",

"language": "eng"

}'

Base64 Upload (for local files)

# Convert PDF to base64

base64 document.pdf > document_b64.txt

Send to API

curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \

-H "Content-Type: application/json" \

-d '{

"pdfBase64": "'$(cat document_b64.txt)'",

"language": "eng"

}'

---

Webhook Integration (Zapier, Make, n8n)

Webhook Payload Format

{

"event": "ocr_completed",

"timestamp": "2025-12-23T12:00:00.000Z",

"actor": "pdf-ocr-api",

"runId": "abc123",

"totalPages": 10,

"processedPages": 10,

"averageConfidence": 92.5,

"fullText": "...",

"pages": [...]

}

Common Automations

---

Limitations

---

Support

🏆 2025 Zapier Automation Hero of the YearProject Phoenix: A 95-step AI sales pipeline cutting development time by 50%. Read more →

---

Built by John Rippy | johnrippy.link
🏆 2025 Zapier Automation Hero of the YearProject Phoenix: A 95-step AI sales pipeline cutting development time by 50%. Read more →

---

Keywords

pdf ocr, pdf text extraction, ocr api, scanned pdf to text, document digitization, pdf scraper, image to text, optical character recognition, pdf parser, document processing, invoice ocr, form extraction, adobe alternative, abbyy alternative, tesseract ocr, multi-language ocr

149,000
KILLS
100%
HEALTH
Doomguy
274
ACTORS
0/3
SECRETS