robots.txt and Sitemap.xml: Best Practices for Crawl Control

Technical SEO

robots.txt Fundamentals

robots.txt is a plain-text file at the root of your domain that instructs web crawlers which pages they can and cannot access.

User-agent: *
Disallow: /admin/
Disallow: /internal/
Allow: /

User-agent: Googlebot
Disallow: /staging/

Key points: - * applies to all crawlers; named rules take precedence - Disallow: / blocks all crawling (useful for staging environments) - The file must be at https://yourdomain.com/robots.txt

Sitemap.xml Fundamentals

A sitemap tells search engines which URLs to index and how often they change:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-05-01</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

Validating via API

curl -X POST https://api.toolkitapi.io/v1/seo/validate-robots \
  -H "X-API-Key: $API_KEY" \
  -d '{"url": "https://example.com"}'
{
  "robots_txt_found": true,
  "sitemap_declared": true,
  "sitemap_url": "https://example.com/sitemap.xml",
  "sitemap_urls_count": 142,
  "blocked_paths": ["/admin/", "/internal/"],
  "issues": []
}

Common Mistakes

Mistake Consequence
Blocking CSS/JS via robots.txt Google can't render pages correctly
Disallow: / on production Site completely removed from search
Sitemap not declared in robots.txt Crawlers may not find it
Sitemap contains non-canonical URLs Confuses crawlers about the preferred URL

Linking Sitemap in robots.txt

Always declare your sitemap:

Sitemap: https://example.com/sitemap.xml

Try it out

Browse Tools →

More from the Blog