Extraction and AI

The unified scrape API is designed to return more than raw HTML. You can ask for structured fields, cleaned content, metadata, and LLM-friendly outputs in the same request.

Output modes

Output Best for
html raw parsing and archival
markdown LLM and RAG pipelines
text NLP and search indexing
clean article-like readable output

Article extraction

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.extract_article(
        url="https://toolkitapi.io/blog/post",
        include_links=True,
    )

    print(result.get("article"))
    print(result["content"])

Selector-based extraction

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.fetch(
        url="https://toolkitapi.io/product/123",
        render_js=True,
        extract={
            "selectors": {
                "title": "h1",
                "price": ".price",
                "image_urls": {
                    "selector": ".gallery img",
                    "attr": "src",
                    "multiple": True,
                },
            }
        },
    )

    print(result["selectors"])
from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.fetch(
        url="https://toolkitapi.io",
        output="markdown",
        extract={
            "links": True,
            "images": True,
            "meta_tags": True,
            "structured_data": True,
            "headers": True,
            "link_preview": True,
        },
    )

    print(result.get("links"))
    print(result.get("meta_tags"))
    print(result.get("structured_data"))

Chunk content for RAG

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.fetch(
        url="https://toolkitapi.io/docs",
        output="markdown",
        chunk={
            "enabled": True,
            "method": "heading",
            "max_tokens": 500,
        },
    )

    print(result.get("chunks"))

Schema-based AI extraction

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    result = scrape.ai_extract(
        url="https://toolkitapi.io/pricing",
        prompt="Extract plan names, prices, and short descriptions.",
        schema={
            "type": "object",
            "properties": {
                "plans": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "string"},
                            "summary": {"type": "string"},
                        }
                    }
                }
            }
        },
    )

    print(result.get("ai_extract"))

Best practices

  • Use markdown for AI-facing workflows
  • Add render_js only when the target site truly needs it
  • Combine selectors and metadata extraction in one request to reduce round-trips
  • Use chunking when documents will be indexed or embedded downstream