PDF OCR API
"Extract Text from Any PDF for Pennies" by John Rippy | johnrippy.link
🏆 2025 Zapier Automation Hero of the Year — Project Phoenix: A 95-step AI sales pipeline cutting development time by 50%. Read more →
---
Stop Paying for Expensive OCR Services
You're currently paying: Adobe Acrobat Pro ($22.99/mo), ABBYY FineReader ($199/year), Google Document AI ($1.50/1000 pages), Amazon Textract ($1.50/1000 pages). What if you could extract text for a fraction of the cost?The PDF OCR API extracts text from any PDF - scanned documents, image-based PDFs, and multi-page files:
- Scanned document support (OCR)
- Multi-page processing (any length)
- 14 language support (English, Spanish, French, German, Chinese, Japanese, and more)
- Table structure preservation
- Multiple output formats (text, JSON, Markdown)
- Confidence scores per page
- Page-by-page results
---
Why Choose This Over Traditional OCR Services
1. Pay-Per-Page, Not Per-Month
Traditional tools: $20-$200/month for your business.
This actor: Pay per page processed. Process 100 pages for ~$5. Process 1,000 for ~$40.
Process 500 pages/month and still pay less than an Adobe subscription.2. Support for Any PDF
- Scanned PDFs: Image-based documents from scanners
- Digital PDFs: Native text extraction (faster, more accurate)
- Mixed PDFs: Pages with both text and images
- Multi-page: No limit on document length
3. 14 Languages Supported
4. Table Detection
Preserve table structure from scanned documents. Get rows and columns as structured data.
---
Quick Start Examples
Example 1: Extract Text from URL
{
"pdfUrl": "https://example.com/document.pdf",
"language": "eng",
"outputFormat": "text"
}
Example 2: Process Specific Pages
{
"pdfUrl": "https://example.com/document.pdf",
"pageRange": "1-5",
"language": "eng",
"detectTables": true
}
Example 3: Multi-Language Document
{
"pdfUrl": "https://example.com/document.pdf",
"language": "spa",
"outputFormat": "json"
}
Example 4: With Webhook
{
"pdfUrl": "https://example.com/document.pdf",
"webhookUrl": "https://hooks.zapier.com/hooks/catch/12345/abcdef/"
}
---
Input Parameters
*Either pdfUrl or pdfBase64 is required
---
Output Format
{
"success": true,
"fileName": "document.pdf",
"totalPages": 5,
"processedPages": 5,
"language": "eng",
"processingTime": 2.3,
"pages": [
{
"pageNumber": 1,
"text": "This is the extracted text from page 1...",
"confidence": 95.2,
"wordCount": 342,
"hasImages": true,
"tables": [
{
"rows": 5,
"columns": 3,
"data": [["Header1", "Header2", "Header3"], ...]
}
]
}
],
"fullText": "Complete document text concatenated...",
"wordCount": 1250,
"averageConfidence": 94.5
}
---
Pay-Per-Event Pricing
You only pay for what you use. No monthly fees. No minimums.Cost Examples
For low-to-medium volume, save compared to subscriptions. For high volume, competitive with cloud APIs.---
Use Cases
Document Digitization
- Archive processing: Make historical documents searchable
- Paper to digital: Convert scanned documents to text
- Record keeping: Digitize contracts, invoices, receipts
Data Extraction
- Invoice processing: Extract line items, totals, dates
- Form processing: Pull data from scanned forms
- Contract analysis: Extract key terms and clauses
Research & Academia
- Academic papers: Extract text from PDF research papers
- Book scanning: Digitize book chapters and pages
- Citation extraction: Pull references from documents
Legal & Compliance
- Legal discovery: Process large document sets
- Contract review: Extract text for analysis
- Compliance audits: Digitize paper records
Developers
- API integration: RESTful JSON responses
- Webhook support: Async processing for large documents
- Multiple formats: Text, JSON, or Markdown output
---
Confidence Scores
Each page includes a confidence score (0-100%):
Low confidence usually indicates:
- Poor scan quality
- Unusual fonts
- Handwritten text
- Low resolution images
---
API Integration
Using the Apify API (JavaScript)
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('localhowl/pdf-ocr-api').call({
pdfUrl: 'https://example.com/document.pdf',
language: 'eng',
outputFormat: 'json'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].fullText);
Using cURL
curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"pdfUrl": "https://example.com/document.pdf",
"language": "eng"
}'
Base64 Upload (for local files)
# Convert PDF to base64
base64 document.pdf > document_b64.txt
Send to API
curl -X POST "https://api.apify.com/v2/acts/localhowl~pdf-ocr-api/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"pdfBase64": "'$(cat document_b64.txt)'",
"language": "eng"
}'
---
Webhook Integration (Zapier, Make, n8n)
Webhook Payload Format
{
"event": "ocr_completed",
"timestamp": "2025-12-23T12:00:00.000Z",
"actor": "pdf-ocr-api",
"runId": "abc123",
"totalPages": 10,
"processedPages": 10,
"averageConfidence": 92.5,
"fullText": "...",
"pages": [...]
}
Common Automations
- Google Drive: Save extracted text alongside PDFs
- Notion/Coda: Create searchable document database
- Slack: Notify when processing completes
- CRM: Attach extracted text to records
---
Limitations
- File Size: Maximum 50MB per PDF
- Handwriting: Limited support for handwritten text
- Complex Layouts: Multi-column layouts may merge incorrectly
- Image Quality: Low-resolution scans reduce accuracy
- Encrypted PDFs: Password-protected PDFs not supported
---
Support
- Email: john@johnrippy.link
🏆 2025 Zapier Automation Hero of the Year — Project Phoenix: A 95-step AI sales pipeline cutting development time by 50%. Read more →
- GitHub: Report issues on the repository
---
Built by John Rippy | johnrippy.link🏆 2025 Zapier Automation Hero of the Year — Project Phoenix: A 95-step AI sales pipeline cutting development time by 50%. Read more →
---
Keywords
pdf ocr, pdf text extraction, ocr api, scanned pdf to text, document digitization, pdf scraper, image to text, optical character recognition, pdf parser, document processing, invoice ocr, form extraction, adobe alternative, abbyy alternative, tesseract ocr, multi-language ocr