Crawl and Site Discovery¶
When you need more than a single page, the scrape toolkit includes helpers for sitemap discovery, robots parsing, asynchronous crawl jobs, and PDF text extraction.
Parse a sitemap¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
sitemap = scrape.parse_sitemap(
"https://toolkitapi.io/sitemap.xml",
limit=100,
discover_links=True,
)
print(sitemap["url_count"])
print(sitemap["urls"][:5])
Parse robots rules¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
robots = scrape.parse_robots_txt("https://toolkitapi.io")
print(robots["agents"])
print(robots["sitemaps"])
Start a crawl job¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
job = scrape.crawl(
start_url="https://toolkitapi.io/docs",
max_pages=20,
max_depth=2,
output="markdown",
clean=True,
respect_robots=True,
)
print(job["job_id"])
Poll the crawl result¶
from toolkitapi import Scrape
import time
with Scrape(api_key="tk_...") as scrape:
job = scrape.crawl(start_url="https://toolkitapi.io/docs")
while True:
result = scrape.get_crawl_job(job["job_id"])
print(result["status"], result["pages_crawled"])
if result["status"] in {"completed", "failed"}:
break
time.sleep(2)
Extract text from remote PDFs¶
from toolkitapi import Scrape
with Scrape(api_key="tk_...") as scrape:
pdf = scrape.pdf_extract(
url="https://toolkitapi.io/report.pdf",
pages="1-3",
)
print(pdf["metadata"])
print(pdf["text"][:1000])
When to use each tool¶
| Need | Recommended helper |
|---|---|
| discover known site URLs | parse_sitemap |
| inspect crawl policy | parse_robots_txt |
| extract many pages on the same site | crawl |
| process published documents | pdf_extract |
Practical workflow¶
A common site-level workflow looks like this:
- Parse the sitemap
- Check robots rules
- Start a crawl for the sections you care about
- Store the returned markdown or text for indexing, search, or AI pipelines