Crawl and Site Discovery

When you need more than a single page, the scrape toolkit includes helpers for sitemap discovery, robots parsing, asynchronous crawl jobs, and PDF text extraction.

Parse a sitemap

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    sitemap = scrape.parse_sitemap(
        "https://toolkitapi.io/sitemap.xml",
        limit=100,
        discover_links=True,
    )

    print(sitemap["url_count"])
    print(sitemap["urls"][:5])

Parse robots rules

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    robots = scrape.parse_robots_txt("https://toolkitapi.io")
    print(robots["agents"])
    print(robots["sitemaps"])

Start a crawl job

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    job = scrape.crawl(
        start_url="https://toolkitapi.io/docs",
        max_pages=20,
        max_depth=2,
        output="markdown",
        clean=True,
        respect_robots=True,
    )

    print(job["job_id"])

Poll the crawl result

from toolkitapi import Scrape
import time

with Scrape(api_key="tk_...") as scrape:
    job = scrape.crawl(start_url="https://toolkitapi.io/docs")

    while True:
        result = scrape.get_crawl_job(job["job_id"])
        print(result["status"], result["pages_crawled"])
        if result["status"] in {"completed", "failed"}:
            break
        time.sleep(2)

Extract text from remote PDFs

from toolkitapi import Scrape

with Scrape(api_key="tk_...") as scrape:
    pdf = scrape.pdf_extract(
        url="https://toolkitapi.io/report.pdf",
        pages="1-3",
    )

    print(pdf["metadata"])
    print(pdf["text"][:1000])

When to use each tool

Need Recommended helper
discover known site URLs parse_sitemap
inspect crawl policy parse_robots_txt
extract many pages on the same site crawl
process published documents pdf_extract

Practical workflow

A common site-level workflow looks like this:

  1. Parse the sitemap
  2. Check robots rules
  3. Start a crawl for the sections you care about
  4. Store the returned markdown or text for indexing, search, or AI pipelines