Crawl and Site Discovery

Site-level scraping helpers: sitemap discovery, robots.txt parsing, asynchronous crawl jobs, and PDF text extraction.

Parse a sitemap

curl -X POST "https://scrape.toolkitapi.io/v1/scrape/sitemap" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io/sitemap.xml", "limit": 100, "discover_links": true}'
import requests

resp = requests.post(
    "https://scrape.toolkitapi.io/v1/scrape/sitemap",
    headers={"X-API-Key": "YOUR_KEY"},
    json={"url": "https://toolkitapi.io/sitemap.xml", "limit": 100, "discover_links": True},
)
data = resp.json()
print(f"Found {data['url_count']} URLs")
print(data["urls"][:5])
const resp = await fetch("https://scrape.toolkitapi.io/v1/scrape/sitemap", {
  method: "POST",
  headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
  body: JSON.stringify({ url: "https://toolkitapi.io/sitemap.xml", limit: 100, discover_links: true }),
});
const data = await resp.json();
console.log(`${data.url_count} URLs found`);

Parse robots.txt

curl -X POST "https://scrape.toolkitapi.io/v1/scrape/robots" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io"}'
import requests

resp = requests.post(
    "https://scrape.toolkitapi.io/v1/scrape/robots",
    headers={"X-API-Key": "YOUR_KEY"},
    json={"url": "https://toolkitapi.io"},
)
data = resp.json()
print(f"Agents: {data['agents']}")
print(f"Sitemaps: {data['sitemaps']}")
for rule in data.get("rules", []):
    print(f"  {rule['agent']}: allow={rule.get('allow')}, disallow={rule.get('disallow')}")
const resp = await fetch("https://scrape.toolkitapi.io/v1/scrape/robots", {
  method: "POST",
  headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
  body: JSON.stringify({ url: "https://toolkitapi.io" }),
});
const data = await resp.json();
console.log(data.agents, data.sitemaps);

Start a crawl job

curl -X POST "https://scrape.toolkitapi.io/v1/scrape/crawl" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://toolkitapi.io/docs",
    "max_pages": 20,
    "max_depth": 2,
    "output": "markdown",
    "clean": true,
    "respect_robots": true
  }'
from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    job = scrape.crawl(
        start_url="https://toolkitapi.io/docs",
        max_pages=20,
        max_depth=2,
        output="markdown",
        clean=True,
        respect_robots=True,
    )
    print(f"Job ID: {job['job_id']}")
const resp = await fetch("https://scrape.toolkitapi.io/v1/scrape/crawl", {
  method: "POST",
  headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://toolkitapi.io/docs",
    max_pages: 20,
    max_depth: 2,
    output: "markdown",
    clean: true,
    respect_robots: true,
  }),
});
const data = await resp.json();
console.log(`Job: ${data.job_id} (${data.status})`);
Response
{
  "job_id": "crawl_abc123",
  "status": "queued",
  "estimated_pages": 20
}

Poll crawl status

curl "https://scrape.toolkitapi.io/v1/scrape/crawl/crawl_abc123" \
  -H "X-API-Key: YOUR_KEY"
from toolkitapi import Scrape
import time

with Scrape(api_key="tk_...") as scrape:
    job = scrape.crawl(start_url="https://toolkitapi.io/docs")

    while True:
        result = scrape.get_crawl_job(job["job_id"])
        print(f"Status: {result['status']}, Pages: {result['pages_crawled']}")
        if result["status"] in {"completed", "failed"}:
            break
        time.sleep(2)
const jobId = "crawl_abc123";
const resp = await fetch(`https://scrape.toolkitapi.io/v1/scrape/crawl/${jobId}`, {
  headers: { "X-API-Key": "YOUR_KEY" },
});
const data = await resp.json();
console.log(`${data.status}: ${data.pages_crawled} pages`);
Response
{
  "job_id": "crawl_abc123",
  "status": "running",
  "pages_crawled": 8,
  "pages_total": 20,
  "current_url": "https://toolkitapi.io/docs/toolkits/dns/record-lookups/"
}

Extract text from PDFs

curl -X POST "https://scrape.toolkitapi.io/v1/scrape/pdf" \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://toolkitapi.io/report.pdf", "pages": "1-3"}'
from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    pdf = scrape.pdf_extract(
        url="https://toolkitapi.io/report.pdf",
        pages="1-3",
    )
    print(pdf["metadata"])
    print(pdf["text"][:1000])

Common parameters

POST /v1/scrape/crawl

Parameter Type Required Description
url string Yes Starting URL for the crawl
max_pages integer No Max pages to crawl. Default: 25
max_depth integer No Max link depth from start. Default: 2
output string No html, markdown, text, clean. Default: markdown
clean boolean No Extract clean article content. Default: false
respect_robots boolean No Honor robots.txt rules. Default: true
include_paths string[] No URL patterns to include
exclude_paths string[] No URL patterns to exclude

POST /v1/scrape/sitemap

Parameter Type Required Description
url string Yes Sitemap URL (or domain to auto-discover)
limit integer No Max URLs to return. Default: 100
discover_links boolean No Follow sitemap index entries. Default: false

When to use each tool

Need Endpoint
Discover known site URLs POST /v1/scrape/sitemap
Inspect crawl policy POST /v1/scrape/robots
Extract many pages on the same site POST /v1/scrape/crawl
Process published documents POST /v1/scrape/pdf

Practical workflow

A common site-level workflow:

  1. Parse the sitemap to discover known URLs.
  2. Check robots.txt rules to respect the site's crawl policy.
  3. Start a crawl for the sections you care about, with respect_robots: true.
  4. Poll the job status every 2-3 seconds until completed.
  5. Process the returned markdown/text for indexing, search, or AI pipelines.