Extraction and OCR¶
The extraction side of the PDF Toolkit is designed for text mining, document intelligence, and form processing workflows.
Extraction endpoints¶
| Endpoint | Purpose |
|---|---|
POST /v1/pdf/text |
Extract text page by page |
POST /v1/pdf/metadata |
Read or update metadata |
POST /v1/pdf/table-extract |
Extract tabular data |
POST /v1/pdf/form-fields |
Read or fill AcroForm fields |
POST /v1/pdf/info |
Get page count and structure |
POST /v1/pdf/ocr |
OCR scanned or image-based PDFs |
Python SDK examples¶
Extract text¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.extract_text(
url="https://toolkitapi.io/report.pdf",
pages="1-3",
)
print(result["total_word_count"])
print(result["pages"][0]["text"][:500])
Read metadata¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.metadata(url="https://toolkitapi.io/report.pdf")
print(result["metadata"])
Update metadata¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.metadata(
url="https://toolkitapi.io/report.pdf",
update={
"title": "Q2 Investor Update",
"author": "Toolkit API",
},
)
print(result["page_count"])
Extract tables¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.table_extract(
url="https://toolkitapi.io/financials.pdf",
pages="2-4",
)
print(result["total_tables"])
Read or fill form fields¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
fields = pdf.form_fields(url="https://toolkitapi.io/form.pdf")
print(fields["fields"])
filled = pdf.form_fields(
url="https://toolkitapi.io/form.pdf",
fill={"first_name": "Chris", "email": "[email protected]"},
)
print(filled["total_fields"])
Structural info¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
info = pdf.info(url="https://toolkitapi.io/report.pdf")
print(info["page_count"])
print(info["is_encrypted"])
OCR a scanned PDF¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.ocr(
url="https://toolkitapi.io/scanned.pdf",
pages="1-2",
language="eng",
dpi=300,
)
print(result["total_word_count"])
When to use OCR¶
Use OCR when the document is image-based or when normal text extraction returns little or no content.