Extraction and OCR

The extraction side of the PDF Toolkit is designed for text mining, document intelligence, and form processing workflows.

Extraction endpoints

Endpoint Purpose
POST /v1/pdf/text Extract text page by page
POST /v1/pdf/metadata Read or update metadata
POST /v1/pdf/table-extract Extract tabular data
POST /v1/pdf/form-fields Read or fill AcroForm fields
POST /v1/pdf/info Get page count and structure
POST /v1/pdf/ocr OCR scanned or image-based PDFs

REST API Examples

Extract text

curl -X POST "https://pdf.toolkitapi.io/v1/pdf/text" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io/report.pdf", "pages": "1-3"}'
const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/text", {
  method: "POST",
  headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
  body: JSON.stringify({ url: "https://toolkitapi.io/report.pdf", pages: "1-3" }),
});
const data = await resp.json();
data.pages.forEach(p => console.log(`Page ${p.page}: ${p.text.substring(0, 100)}...`));

Get PDF metadata

curl -X POST "https://pdf.toolkitapi.io/v1/pdf/metadata" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io/report.pdf"}'

Extract tables

curl -X POST "https://pdf.toolkitapi.io/v1/pdf/table-extract" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io/financials.pdf", "pages": "1-5"}'

OCR a scanned PDF

curl -X POST "https://pdf.toolkitapi.io/v1/pdf/ocr" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io/scanned-doc.pdf"}'

Python SDK Examples

Extract text

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.extract_text(
        url="https://toolkitapi.io/report.pdf",
        pages="1-3",
    )
    print(result["total_word_count"])
    print(result["pages"][0]["text"][:500])

Read metadata

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.metadata(url="https://toolkitapi.io/report.pdf")
    print(result["metadata"])

Update metadata

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.metadata(
        url="https://toolkitapi.io/report.pdf",
        update={
            "title": "Q2 Investor Update",
            "author": "Toolkit API",
        },
    )
    print(result["page_count"])

Extract tables

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.table_extract(
        url="https://toolkitapi.io/financials.pdf",
        pages="2-4",
    )
    print(result["total_tables"])

Read or fill form fields

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    fields = pdf.form_fields(url="https://toolkitapi.io/form.pdf")
    print(fields["fields"])

    filled = pdf.form_fields(
        url="https://toolkitapi.io/form.pdf",
        fill={"first_name": "Chris", "email": "[email protected]"},
    )
    print(filled["total_fields"])

Structural info

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    info = pdf.info(url="https://toolkitapi.io/report.pdf")
    print(info["page_count"])
    print(info["is_encrypted"])

OCR a scanned PDF

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.ocr(
        url="https://toolkitapi.io/scanned.pdf",
        pages="1-2",
        language="eng",
        dpi=300,
    )
    print(result["total_word_count"])

When to use OCR

Use OCR when the document is image-based or when normal text extraction returns little or no content.