Web Scraping Toolkit

The Web Scraping Toolkit gives you a unified scraping API with richer structured outputs. Use it to fetch HTML, Markdown, or plain text, render JavaScript when needed, extract metadata and selectors, and run SEO or crawl workflows without maintaining your own browser infrastructure.

It is especially useful for teams who want the convenience of a “single scrape endpoint” model but also want better downstream ergonomics for LLM ingestion, data extraction, and content QA.

Base URL

https://scrape.toolkitapi.io

Start here: the unified scrape endpoint

For most jobs, start with:

POST /v1/scrape

This one endpoint supports:

  • output: html, markdown, text, or clean
  • render_js for client-side rendered pages
  • wait_for, wait_until, and scroll for timing control
  • extract.links, extract.images, extract.meta_tags, extract.structured_data, extract.headers, extract.link_preview
  • extract.selectors for CSS-selector-based structured extraction
  • ai_extract for schema-driven AI extraction
  • chunk for LLM/RAG chunking
  • proxy, headers, cookies, and session_name for more advanced networking flows

Browse by topic

To make the examples easier to work through, the scrape docs now have focused subpages:

First example

curl -X POST "https://scrape.toolkitapi.io/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_KEY" \
  -d '{
    "url": "https://www.python.org",
    "output": "markdown",
    "extract": {
      "meta_tags": true,
      "link_preview": true,
      "selectors": {
        "headline": "h1"
      }
    }
  }'
{
  "url": "https://www.python.org/",
  "status_code": 200,
  "content_type": "text/html; charset=utf-8",
  "js_rendered": false,
  "content": "# Welcome to Python.org\n...",
  "output_format": "markdown",
  "word_count": 220,
  "char_count": 3133,
  "meta_tags": {
    "title": "Welcome to Python.org"
  },
  "selectors": {
    "headline": "Welcome to Python.org"
  }
}

Output formats

Output Use it when you need Result
html raw parsing or archival HTML content
markdown LLM/RAG ingestion clean Markdown
text search, classification, NLP plain text
clean article-like readable extraction noise-reduced content

Common scraping patterns

1. LLM-ready content

Use Markdown or clean text when the response will be embedded, chunked, or summarized.

{
  "url": "https://toolkitapi.io/docs",
  "output": "markdown",
  "chunk": { "enabled": true, "method": "heading", "max_tokens": 500 }
}

2. Structured extraction with selectors

This is the closest equivalent to ScrapingBee-style extract_rules:

{
  "url": "https://toolkitapi.io/product/123",
  "extract": {
    "selectors": {
      "title": "h1",
      "price": ".price",
      "images": { "selector": ".gallery img", "attr": "src", "multiple": true }
    }
  }
}

3. JavaScript-heavy pages

{
  "url": "https://quotes.toscrape.com/js/",
  "render_js": true,
  "wait_until": "networkidle",
  "block_resources": ["image", "font"],
  "stealth": true,
  "output": "text"
}

4. SEO and monitoring workflows

You can audit a single URL, compare multiple pages, or check keyword density and page speed without wiring up a second service.

Specialised endpoints

Endpoint Purpose
GET /v1/scrape/audit Full SEO audit for one page
GET /v1/scrape/keyword-density Readability and top-keyword analysis
GET /v1/scrape/mobile-friendly Mobile-readiness checks
GET /v1/scrape/broken-links Link health and status validation
GET /v1/scrape/pagespeed TTFB, compression, and response size
POST /v1/scrape/bulk-audit SEO audit for multiple URLs
POST /v1/scrape/compare Side-by-side page comparison
POST /v1/scrape/pdf Extract text from remote PDFs
GET /v1/scrape/sitemap Parse sitemap files or sitemap indexes
GET /v1/scrape/robots Parse robots.txt rules
POST /v1/scrape/crawl Start an async same-domain crawl
GET /v1/scrape/crawl/{job_id} Poll crawl job status and progress
POST /v1/screenshot Capture a full-page screenshot
POST /v1/screenshot/element Capture a specific element screenshot
POST /v1/screenshot/pdf Capture a page as PDF
GET /v1/screenshot/download/{object_name} Download a previously generated screenshot artifact

If you are migrating from ScrapingBee

Here is the rough feature map:

ScrapingBee concept Toolkit equivalent
render_js=true render_js: true
return_page_markdown=true output: "markdown"
return_page_text=true output: "text"
extract_rules extract.selectors
wait_for wait_for
wait_browser wait_until
block_resources block_resources
premium_proxy / country_code proxy
AI extraction rules ai_extract.schema + optional prompt

See the full migration guide here:

Why teams use this toolkit

  • One request instead of many — scrape content and metadata in a single call
  • Cleaner LLM input — Markdown, text, chunking, and structured output are first-class features
  • SEO-ready — audit, compare, and performance endpoints are built in
  • Crawl support — move from single-page fetches to async multi-page jobs when needed

Comparison mindset

If you are evaluating this against more general scraping providers, the main value is not just “can it fetch the page?” — it is how much useful structure you get back immediately.

Workflow need Typical generic scrape flow Toolkit API flow
LLM-ready article extraction Fetch HTML → clean it → convert it POST /v1/scrape with output: "markdown"
CSS-based field extraction Fetch HTML → run selector parser extract.selectors in the same request
Metadata enrichment Separate preview/meta request extract.meta_tags and extract.link_preview
Site QA Use another SEO tool built-in /audit, /pagespeed, /broken-links
Multi-page discovery Build custom crawler POST /v1/scrape/crawl

Scraping still depends on the target site's accessibility, rate limits, and bot protections. Respect legal and platform policies, and prefer the crawl and robots endpoints for site-aware automation.