Extraction and AI¶

The unified scrape API is designed to return more than raw HTML. You can ask for structured fields, cleaned content, metadata, and LLM-friendly outputs in the same request.

Output modes¶

Output	Best for
html	raw parsing and archival
markdown	LLM and RAG pipelines
text	NLP and search indexing
clean	article-like readable output

Article extraction¶

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.extract_article(
        url="https://toolkitapi.io/blog/post",
        include_links=True,
    )

    print(result.get("article"))
    print(result["content"])

Selector-based extraction¶

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.fetch(
        url="https://toolkitapi.io/product/123",
        render_js=True,
        extract={
            "selectors": {
                "title": "h1",
                "price": ".price",
                "image_urls": {
                    "selector": ".gallery img",
                    "attr": "src",
                    "multiple": True,
                },
            }
        },
    )

    print(result["selectors"])

Extract links, images, and metadata together¶

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.fetch(
        url="https://toolkitapi.io",
        output="markdown",
        extract={
            "links": True,
            "images": True,
            "meta_tags": True,
            "structured_data": True,
            "headers": True,
            "link_preview": True,
        },
    )

    print(result.get("links"))
    print(result.get("meta_tags"))
    print(result.get("structured_data"))

Chunk content for RAG¶

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.fetch(
        url="https://toolkitapi.io/docs",
        output="markdown",
        chunk={
            "enabled": True,
            "method": "heading",
            "max_tokens": 500,
        },
    )

    print(result.get("chunks"))

Schema-based AI extraction¶

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.ai_extract(
        url="https://toolkitapi.io/pricing",
        prompt="Extract plan names, prices, and short descriptions.",
        schema={
            "type": "object",
            "properties": {
                "plans": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                            "summary": {"type": "string"},
                        }
                    }
                }
            }
        },
    )

    print(result.get("ai_extract"))

Best practices¶

Use markdown for AI-facing workflows
Add render_js only when the target site truly needs it
Combine selectors and metadata extraction in one request to reduce round-trips
Use chunking when documents will be indexed or embedded downstream

Rendering & Proxies

Crawl & Discovery