Extraction and AI¶
The unified scrape API is designed to return more than raw HTML. You can ask for structured fields, cleaned content, metadata, and LLM-friendly outputs in the same request.
Output modes¶
| Output | Best for |
|---|---|
| html | raw parsing and archival |
| markdown | LLM and RAG pipelines |
| text | NLP and search indexing |
| clean | article-like readable output |
Article extraction¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
result = scrape.extract_article(
url="https://toolkitapi.io/blog/post",
include_links=True,
)
print(result.get("article"))
print(result["content"])
Selector-based extraction¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
result = scrape.fetch(
url="https://toolkitapi.io/product/123",
render_js=True,
extract={
"selectors": {
"title": "h1",
"price": ".price",
"image_urls": {
"selector": ".gallery img",
"attr": "src",
"multiple": True,
},
}
},
)
print(result["selectors"])
Extract links, images, and metadata together¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
result = scrape.fetch(
url="https://toolkitapi.io",
output="markdown",
extract={
"links": True,
"images": True,
"meta_tags": True,
"structured_data": True,
"headers": True,
"link_preview": True,
},
)
print(result.get("links"))
print(result.get("meta_tags"))
print(result.get("structured_data"))
Chunk content for RAG¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
result = scrape.fetch(
url="https://toolkitapi.io/docs",
output="markdown",
chunk={
"enabled": True,
"method": "heading",
"max_tokens": 500,
},
)
print(result.get("chunks"))
Schema-based AI extraction¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
result = scrape.ai_extract(
url="https://toolkitapi.io/pricing",
prompt="Extract plan names, prices, and short descriptions.",
schema={
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"summary": {"type": "string"},
}
}
}
}
},
)
print(result.get("ai_extract"))
Best practices¶
- Use markdown for AI-facing workflows
- Add render_js only when the target site truly needs it
- Combine selectors and metadata extraction in one request to reduce round-trips
- Use chunking when documents will be indexed or embedded downstream