Extraction and OCR

The extraction side of the PDF Toolkit is designed for text mining, document intelligence, and form processing workflows.

Extraction endpoints

Endpoint Purpose
POST /v1/pdf/text Extract text page by page
POST /v1/pdf/metadata Read or update metadata
POST /v1/pdf/table-extract Extract tabular data
POST /v1/pdf/form-fields Read or fill AcroForm fields
POST /v1/pdf/info Get page count and structure
POST /v1/pdf/ocr OCR scanned or image-based PDFs

Python SDK examples

Extract text

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.extract_text(
        url="https://toolkitapi.io/report.pdf",
        pages="1-3",
    )
    print(result["total_word_count"])
    print(result["pages"][0]["text"][:500])

Read metadata

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.metadata(url="https://toolkitapi.io/report.pdf")
    print(result["metadata"])

Update metadata

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.metadata(
        url="https://toolkitapi.io/report.pdf",
        update={
            "title": "Q2 Investor Update",
            "author": "Toolkit API",
        },
    )
    print(result["page_count"])

Extract tables

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.table_extract(
        url="https://toolkitapi.io/financials.pdf",
        pages="2-4",
    )
    print(result["total_tables"])

Read or fill form fields

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    fields = pdf.form_fields(url="https://toolkitapi.io/form.pdf")
    print(fields["fields"])

    filled = pdf.form_fields(
        url="https://toolkitapi.io/form.pdf",
        fill={"first_name": "Chris", "email": "[email protected]"},
    )
    print(filled["total_fields"])

Structural info

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    info = pdf.info(url="https://toolkitapi.io/report.pdf")
    print(info["page_count"])
    print(info["is_encrypted"])

OCR a scanned PDF

from toolkitapi import PDF

with PDF(api_key="tk_...") as pdf:
    result = pdf.ocr(
        url="https://toolkitapi.io/scanned.pdf",
        pages="1-2",
        language="eng",
        dpi=300,
    )
    print(result["total_word_count"])

When to use OCR

Use OCR when the document is image-based or when normal text extraction returns little or no content.