Crawl and Site Discovery¶
Site-level scraping helpers: sitemap discovery, robots.txt parsing, asynchronous crawl jobs, and PDF text extraction.
Parse a sitemap¶
curl -X POST "https://scrape.toolkitapi.io/v1/scrape/sitemap" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io/sitemap.xml", "limit": 100, "discover_links": true}'
import requests
resp = requests.post(
"https://scrape.toolkitapi.io/v1/scrape/sitemap",
headers={"X-API-Key": "YOUR_KEY"},
json={"url": "https://toolkitapi.io/sitemap.xml", "limit": 100, "discover_links": True},
)
data = resp.json()
print(f"Found {data['url_count']} URLs")
print(data["urls"][:5])
const resp = await fetch("https://scrape.toolkitapi.io/v1/scrape/sitemap", {
method: "POST",
headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
body: JSON.stringify({ url: "https://toolkitapi.io/sitemap.xml", limit: 100, discover_links: true }),
});
const data = await resp.json();
console.log(`${data.url_count} URLs found`);
Parse robots.txt¶
curl -X POST "https://scrape.toolkitapi.io/v1/scrape/robots" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io"}'
import requests
resp = requests.post(
"https://scrape.toolkitapi.io/v1/scrape/robots",
headers={"X-API-Key": "YOUR_KEY"},
json={"url": "https://toolkitapi.io"},
)
data = resp.json()
print(f"Agents: {data['agents']}")
print(f"Sitemaps: {data['sitemaps']}")
for rule in data.get("rules", []):
print(f" {rule['agent']}: allow={rule.get('allow')}, disallow={rule.get('disallow')}")
const resp = await fetch("https://scrape.toolkitapi.io/v1/scrape/robots", {
method: "POST",
headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
body: JSON.stringify({ url: "https://toolkitapi.io" }),
});
const data = await resp.json();
console.log(data.agents, data.sitemaps);
Start a crawl job¶
curl -X POST "https://scrape.toolkitapi.io/v1/scrape/crawl" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://toolkitapi.io/docs",
"max_pages": 20,
"max_depth": 2,
"output": "markdown",
"clean": true,
"respect_robots": true
}'
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
job = scrape.crawl(
start_url="https://toolkitapi.io/docs",
max_pages=20,
max_depth=2,
output="markdown",
clean=True,
respect_robots=True,
)
print(f"Job ID: {job['job_id']}")
const resp = await fetch("https://scrape.toolkitapi.io/v1/scrape/crawl", {
method: "POST",
headers: { "X-API-Key": "YOUR_KEY", "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://toolkitapi.io/docs",
max_pages: 20,
max_depth: 2,
output: "markdown",
clean: true,
respect_robots: true,
}),
});
const data = await resp.json();
console.log(`Job: ${data.job_id} (${data.status})`);
Response
{
"job_id": "crawl_abc123",
"status": "queued",
"estimated_pages": 20
}
Poll crawl status¶
curl "https://scrape.toolkitapi.io/v1/scrape/crawl/crawl_abc123" \
-H "X-API-Key: YOUR_KEY"
from toolkitapi import Scrape
import time
with Scrape(api_key="tk_...") as scrape:
job = scrape.crawl(start_url="https://toolkitapi.io/docs")
while True:
result = scrape.get_crawl_job(job["job_id"])
print(f"Status: {result['status']}, Pages: {result['pages_crawled']}")
if result["status"] in {"completed", "failed"}:
break
time.sleep(2)
const jobId = "crawl_abc123";
const resp = await fetch(`https://scrape.toolkitapi.io/v1/scrape/crawl/${jobId}`, {
headers: { "X-API-Key": "YOUR_KEY" },
});
const data = await resp.json();
console.log(`${data.status}: ${data.pages_crawled} pages`);
Response
{
"job_id": "crawl_abc123",
"status": "running",
"pages_crawled": 8,
"pages_total": 20,
"current_url": "https://toolkitapi.io/docs/toolkits/dns/record-lookups/"
}
curl -X POST "https://scrape.toolkitapi.io/v1/scrape/pdf" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://toolkitapi.io/report.pdf", "pages": "1-3"}'
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
pdf = scrape.pdf_extract(
url="https://toolkitapi.io/report.pdf",
pages="1-3",
)
print(pdf["metadata"])
print(pdf["text"][:1000])
Common parameters¶
POST /v1/scrape/crawl¶
| Parameter |
Type |
Required |
Description |
url |
string |
Yes |
Starting URL for the crawl |
max_pages |
integer |
No |
Max pages to crawl. Default: 25 |
max_depth |
integer |
No |
Max link depth from start. Default: 2 |
output |
string |
No |
html, markdown, text, clean. Default: markdown |
clean |
boolean |
No |
Extract clean article content. Default: false |
respect_robots |
boolean |
No |
Honor robots.txt rules. Default: true |
include_paths |
string[] |
No |
URL patterns to include |
exclude_paths |
string[] |
No |
URL patterns to exclude |
POST /v1/scrape/sitemap¶
| Parameter |
Type |
Required |
Description |
url |
string |
Yes |
Sitemap URL (or domain to auto-discover) |
limit |
integer |
No |
Max URLs to return. Default: 100 |
discover_links |
boolean |
No |
Follow sitemap index entries. Default: false |
| Need |
Endpoint |
| Discover known site URLs |
POST /v1/scrape/sitemap |
| Inspect crawl policy |
POST /v1/scrape/robots |
| Extract many pages on the same site |
POST /v1/scrape/crawl |
| Process published documents |
POST /v1/scrape/pdf |
Practical workflow¶
A common site-level workflow:
- Parse the sitemap to discover known URLs.
- Check robots.txt rules to respect the site's crawl policy.
- Start a crawl for the sections you care about, with
respect_robots: true.
- Poll the job status every 2-3 seconds until
completed.
- Process the returned markdown/text for indexing, search, or AI pipelines.