robots.txt and Sitemap.xml: Best Practices for Crawl Control
Technical SEO
robots.txt Fundamentals
robots.txt is a plain-text file at the root of your domain that instructs
web crawlers which pages they can and cannot access.
User-agent: *
Disallow: /admin/
Disallow: /internal/
Allow: /
User-agent: Googlebot
Disallow: /staging/
Key points:
- * applies to all crawlers; named rules take precedence
- Disallow: / blocks all crawling (useful for staging environments)
- The file must be at https://yourdomain.com/robots.txt
Sitemap.xml Fundamentals
A sitemap tells search engines which URLs to index and how often they change:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-05-01</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
</urlset>
Validating via API
curl -X POST https://api.toolkitapi.io/v1/seo/validate-robots \
-H "X-API-Key: $API_KEY" \
-d '{"url": "https://example.com"}'
{
"robots_txt_found": true,
"sitemap_declared": true,
"sitemap_url": "https://example.com/sitemap.xml",
"sitemap_urls_count": 142,
"blocked_paths": ["/admin/", "/internal/"],
"issues": []
}
Common Mistakes
| Mistake | Consequence |
|---|---|
| Blocking CSS/JS via robots.txt | Google can't render pages correctly |
Disallow: / on production |
Site completely removed from search |
| Sitemap not declared in robots.txt | Crawlers may not find it |
| Sitemap contains non-canonical URLs | Confuses crawlers about the preferred URL |
Linking Sitemap in robots.txt
Always declare your sitemap:
Sitemap: https://example.com/sitemap.xml