Web Scraping Toolkit¶

The Web Scraping Toolkit gives you a unified scraping API with richer structured outputs. Use it to fetch HTML, Markdown, or plain text, render JavaScript when needed, extract metadata and selectors, and run SEO or crawl workflows without maintaining your own browser infrastructure.

It is especially useful for teams who want the convenience of a “single scrape endpoint” model but also want better downstream ergonomics for LLM ingestion, data extraction, and content QA.

Base URL¶

https://scrape.toolkitapi.io

Start here: the unified scrape endpoint¶

For most jobs, start with:

POST /v1/scrape

This one endpoint supports:

output: html, markdown, text, or clean
render_js for client-side rendered pages
wait_for, wait_until, and scroll for timing control
extract.links, extract.images, extract.meta_tags, extract.structured_data, extract.headers, extract.link_preview
extract.selectors for CSS-selector-based structured extraction
ai_extract for schema-driven AI extraction
chunk for LLM/RAG chunking
proxy, headers, cookies, and session_name for more advanced networking flows

Browse by topic¶

To make the examples easier to work through, the scrape docs now have focused subpages:

First example¶

curl -X POST "https://scrape.toolkitapi.io/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_KEY" \
  -d '{
    "url": "https://www.python.org",
    "output": "markdown",
    "extract": {
      "meta_tags": true,
      "link_preview": true,
      "selectors": {
        "headline": "h1"
      }
    }
  }'

{
  "url": "https://www.python.org/",
  "status_code": 200,
  "content_type": "text/html; charset=utf-8",
  "js_rendered": false,
  "content": "# Welcome to Python.org\n...",
  "output_format": "markdown",
  "word_count": 220,
  "char_count": 3133,
  "meta_tags": {
    "title": "Welcome to Python.org"
  },
  "selectors": {
    "headline": "Welcome to Python.org"
  }
}

Output formats¶

Output	Use it when you need	Result
`html`	raw parsing or archival	HTML content
`markdown`	LLM/RAG ingestion	clean Markdown
`text`	search, classification, NLP	plain text
`clean`	article-like readable extraction	noise-reduced content

Common scraping patterns¶

1. LLM-ready content¶

Use Markdown or clean text when the response will be embedded, chunked, or summarized.

{
  "url": "https://toolkitapi.io/docs",
  "output": "markdown",
  "chunk": { "enabled": true, "method": "heading", "max_tokens": 500 }
}

2. Structured extraction with selectors¶

This is the closest equivalent to ScrapingBee-style extract_rules:

{
  "url": "https://toolkitapi.io/product/123",
  "extract": {
    "selectors": {
      "title": "h1",
      "price": ".price",
      "images": { "selector": ".gallery img", "attr": "src", "multiple": true }
    }
  }
}

3. JavaScript-heavy pages¶

{
  "url": "https://quotes.toscrape.com/js/",
  "render_js": true,
  "wait_until": "networkidle",
  "block_resources": ["image", "font"],
  "stealth": true,
  "output": "text"
}

4. SEO and monitoring workflows¶

You can audit a single URL, compare multiple pages, or check keyword density and page speed without wiring up a second service.

Specialised endpoints¶

Endpoint	Purpose
`GET /v1/scrape/audit`	Full SEO audit for one page
`GET /v1/scrape/keyword-density`	Readability and top-keyword analysis
`GET /v1/scrape/mobile-friendly`	Mobile-readiness checks
`GET /v1/scrape/broken-links`	Link health and status validation
`GET /v1/scrape/pagespeed`	TTFB, compression, and response size
`POST /v1/scrape/bulk-audit`	SEO audit for multiple URLs
`POST /v1/scrape/compare`	Side-by-side page comparison
`POST /v1/scrape/pdf`	Extract text from remote PDFs
`GET /v1/scrape/sitemap`	Parse sitemap files or sitemap indexes
`GET /v1/scrape/robots`	Parse `robots.txt` rules
`POST /v1/scrape/crawl`	Start an async same-domain crawl
`GET /v1/scrape/crawl/{job_id}`	Poll crawl job status and progress
`POST /v1/screenshot`	Capture a full-page screenshot
`POST /v1/screenshot/element`	Capture a specific element screenshot
`POST /v1/screenshot/pdf`	Capture a page as PDF
`GET /v1/screenshot/download/{object_name}`	Download a previously generated screenshot artifact

If you are migrating from ScrapingBee¶

Here is the rough feature map:

ScrapingBee concept	Toolkit equivalent
`render_js=true`	`render_js: true`
`return_page_markdown=true`	`output: "markdown"`
`return_page_text=true`	`output: "text"`
`extract_rules`	`extract.selectors`
`wait_for`	`wait_for`
`wait_browser`	`wait_until`
`block_resources`	`block_resources`
`premium_proxy` / `country_code`	`proxy`
AI extraction rules	`ai_extract.schema` + optional prompt

See the full migration guide here:

Why teams use this toolkit¶

One request instead of many — scrape content and metadata in a single call
Cleaner LLM input — Markdown, text, chunking, and structured output are first-class features
SEO-ready — audit, compare, and performance endpoints are built in
Crawl support — move from single-page fetches to async multi-page jobs when needed

Comparison mindset¶

If you are evaluating this against more general scraping providers, the main value is not just “can it fetch the page?” — it is how much useful structure you get back immediately.

Workflow need	Typical generic scrape flow	Toolkit API flow
LLM-ready article extraction	Fetch HTML → clean it → convert it	`POST /v1/scrape` with `output: "markdown"`
CSS-based field extraction	Fetch HTML → run selector parser	`extract.selectors` in the same request
Metadata enrichment	Separate preview/meta request	`extract.meta_tags` and `extract.link_preview`
Site QA	Use another SEO tool	built-in `/audit`, `/pagespeed`, `/broken-links`
Multi-page discovery	Build custom crawler	`POST /v1/scrape/crawl`

Scraping still depends on the target site's accessibility, rate limits, and bot protections. Respect legal and platform policies, and prefer the crawl and robots endpoints for site-aware automation.

LLM Tools

Python Examples