Web Scraping Toolkit¶
The Web Scraping Toolkit gives you a unified scraping API with richer structured outputs. Use it to fetch HTML, Markdown, or plain text, render JavaScript when needed, extract metadata and selectors, and run SEO or crawl workflows without maintaining your own browser infrastructure.
It is especially useful for teams who want the convenience of a “single scrape endpoint” model but also want better downstream ergonomics for LLM ingestion, data extraction, and content QA.
Base URL¶
https://scrape.toolkitapi.io
Start here: the unified scrape endpoint¶
For most jobs, start with:
POST /v1/scrape
This one endpoint supports:
output:html,markdown,text, orcleanrender_jsfor client-side rendered pageswait_for,wait_until, andscrollfor timing controlextract.links,extract.images,extract.meta_tags,extract.structured_data,extract.headers,extract.link_previewextract.selectorsfor CSS-selector-based structured extractionai_extractfor schema-driven AI extractionchunkfor LLM/RAG chunkingproxy,headers,cookies, andsession_namefor more advanced networking flows
Browse by topic¶
To make the examples easier to work through, the scrape docs now have focused subpages:
First example¶
curl -X POST "https://scrape.toolkitapi.io/v1/scrape" \
-H "Content-Type: application/json" \
-H "X-API-Key: YOUR_KEY" \
-d '{
"url": "https://www.python.org",
"output": "markdown",
"extract": {
"meta_tags": true,
"link_preview": true,
"selectors": {
"headline": "h1"
}
}
}'
{
"url": "https://www.python.org/",
"status_code": 200,
"content_type": "text/html; charset=utf-8",
"js_rendered": false,
"content": "# Welcome to Python.org\n...",
"output_format": "markdown",
"word_count": 220,
"char_count": 3133,
"meta_tags": {
"title": "Welcome to Python.org"
},
"selectors": {
"headline": "Welcome to Python.org"
}
}
Output formats¶
| Output | Use it when you need | Result |
|---|---|---|
html |
raw parsing or archival | HTML content |
markdown |
LLM/RAG ingestion | clean Markdown |
text |
search, classification, NLP | plain text |
clean |
article-like readable extraction | noise-reduced content |
Common scraping patterns¶
1. LLM-ready content¶
Use Markdown or clean text when the response will be embedded, chunked, or summarized.
{
"url": "https://toolkitapi.io/docs",
"output": "markdown",
"chunk": { "enabled": true, "method": "heading", "max_tokens": 500 }
}
2. Structured extraction with selectors¶
This is the closest equivalent to ScrapingBee-style extract_rules:
{
"url": "https://toolkitapi.io/product/123",
"extract": {
"selectors": {
"title": "h1",
"price": ".price",
"images": { "selector": ".gallery img", "attr": "src", "multiple": true }
}
}
}
3. JavaScript-heavy pages¶
{
"url": "https://quotes.toscrape.com/js/",
"render_js": true,
"wait_until": "networkidle",
"block_resources": ["image", "font"],
"stealth": true,
"output": "text"
}
4. SEO and monitoring workflows¶
You can audit a single URL, compare multiple pages, or check keyword density and page speed without wiring up a second service.
Specialised endpoints¶
| Endpoint | Purpose |
|---|---|
GET /v1/scrape/audit |
Full SEO audit for one page |
GET /v1/scrape/keyword-density |
Readability and top-keyword analysis |
GET /v1/scrape/mobile-friendly |
Mobile-readiness checks |
GET /v1/scrape/broken-links |
Link health and status validation |
GET /v1/scrape/pagespeed |
TTFB, compression, and response size |
POST /v1/scrape/bulk-audit |
SEO audit for multiple URLs |
POST /v1/scrape/compare |
Side-by-side page comparison |
POST /v1/scrape/pdf |
Extract text from remote PDFs |
GET /v1/scrape/sitemap |
Parse sitemap files or sitemap indexes |
GET /v1/scrape/robots |
Parse robots.txt rules |
POST /v1/scrape/crawl |
Start an async same-domain crawl |
GET /v1/scrape/crawl/{job_id} |
Poll crawl job status and progress |
POST /v1/screenshot |
Capture a full-page screenshot |
POST /v1/screenshot/element |
Capture a specific element screenshot |
POST /v1/screenshot/pdf |
Capture a page as PDF |
GET /v1/screenshot/download/{object_name} |
Download a previously generated screenshot artifact |
If you are migrating from ScrapingBee¶
Here is the rough feature map:
| ScrapingBee concept | Toolkit equivalent |
|---|---|
render_js=true |
render_js: true |
return_page_markdown=true |
output: "markdown" |
return_page_text=true |
output: "text" |
extract_rules |
extract.selectors |
wait_for |
wait_for |
wait_browser |
wait_until |
block_resources |
block_resources |
premium_proxy / country_code |
proxy |
| AI extraction rules | ai_extract.schema + optional prompt |
See the full migration guide here:
Why teams use this toolkit¶
- One request instead of many — scrape content and metadata in a single call
- Cleaner LLM input — Markdown, text, chunking, and structured output are first-class features
- SEO-ready — audit, compare, and performance endpoints are built in
- Crawl support — move from single-page fetches to async multi-page jobs when needed
Comparison mindset¶
If you are evaluating this against more general scraping providers, the main value is not just “can it fetch the page?” — it is how much useful structure you get back immediately.
| Workflow need | Typical generic scrape flow | Toolkit API flow |
|---|---|---|
| LLM-ready article extraction | Fetch HTML → clean it → convert it | POST /v1/scrape with output: "markdown" |
| CSS-based field extraction | Fetch HTML → run selector parser | extract.selectors in the same request |
| Metadata enrichment | Separate preview/meta request | extract.meta_tags and extract.link_preview |
| Site QA | Use another SEO tool | built-in /audit, /pagespeed, /broken-links |
| Multi-page discovery | Build custom crawler | POST /v1/scrape/crawl |
Scraping still depends on the target site's accessibility, rate limits, and bot protections. Respect legal and platform policies, and prefer the crawl and robots endpoints for site-aware automation.