Extraction and OCR¶
The extraction side of the PDF Toolkit is designed for text mining, document intelligence, and form processing workflows.
Extraction endpoints¶
| Endpoint | Purpose |
|---|---|
POST /v1/pdf/text |
Extract text page by page |
POST /v1/pdf/metadata |
Read or update metadata |
POST /v1/pdf/table-extract |
Extract tabular data |
POST /v1/pdf/form-fields |
Read or fill AcroForm fields |
POST /v1/pdf/info |
Get page count and structure |
POST /v1/pdf/ocr |
OCR scanned or image-based PDFs |
REST API Examples¶
Extract text¶
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/text" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io/report.pdf", "pages": "1-3"}'
const resp = await fetch("https://pdf.toolkitapi.io/v1/pdf/text", {
method: "POST",
headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
body: JSON.stringify({ url: "https://toolkitapi.io/report.pdf", pages: "1-3" }),
});
const data = await resp.json();
data.pages.forEach(p => console.log(`Page ${p.page}: ${p.text.substring(0, 100)}...`));
Get PDF metadata¶
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/metadata" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io/report.pdf"}'
Extract tables¶
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/table-extract" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io/financials.pdf", "pages": "1-5"}'
OCR a scanned PDF¶
curl -X POST "https://pdf.toolkitapi.io/v1/pdf/ocr" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io/scanned-doc.pdf"}'
Python SDK Examples¶
Extract text¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.extract_text(
url="https://toolkitapi.io/report.pdf",
pages="1-3",
)
print(result["total_word_count"])
print(result["pages"][0]["text"][:500])
Read metadata¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.metadata(url="https://toolkitapi.io/report.pdf")
print(result["metadata"])
Update metadata¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.metadata(
url="https://toolkitapi.io/report.pdf",
update={
"title": "Q2 Investor Update",
"author": "Toolkit API",
},
)
print(result["page_count"])
Extract tables¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.table_extract(
url="https://toolkitapi.io/financials.pdf",
pages="2-4",
)
print(result["total_tables"])
Read or fill form fields¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
fields = pdf.form_fields(url="https://toolkitapi.io/form.pdf")
print(fields["fields"])
filled = pdf.form_fields(
url="https://toolkitapi.io/form.pdf",
fill={"first_name": "Chris", "email": "[email protected]"},
)
print(filled["total_fields"])
Structural info¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
info = pdf.info(url="https://toolkitapi.io/report.pdf")
print(info["page_count"])
print(info["is_encrypted"])
OCR a scanned PDF¶
from toolkitapi import PDF
with PDF(api_key="tk_...") as pdf:
result = pdf.ocr(
url="https://toolkitapi.io/scanned.pdf",
pages="1-2",
language="eng",
dpi=300,
)
print(result["total_word_count"])
When to use OCR¶
Use OCR when the document is image-based or when normal text extraction returns little or no content.