Should I block the AI crawlers (GPTBot, Claude-Web)?

Depends on your strategy. Blocking them prevents your content from training models - but also from being cited in AI answers. Most publishers now let them crawl but watch attribution patterns.

Is robots.txt case-sensitive?

Paths are case-sensitive (Disallow: /Admin won't block /admin). Directives (User-agent, Disallow) are not.

Robots.txt: The 2026 Best Practices Guide

Never block CSS or JS

Google needs to render your page like a user. Disallowing /js/ or /css/ breaks rendering and tanks rankings. Allow them explicitly even when you allow everything else.

Use Disallow for crawl budget, not security

Robots.txt blocks crawling, not indexing. URLs blocked in robots.txt can still appear in search results if other sites link to them - they'll just have no snippet. Use noindex meta tags for real exclusion.

Reference your sitemap

Add 'Sitemap: https://example.com/sitemap.xml' at the bottom. Helps every crawler - Google, Bing, DuckDuckGo - find your URL inventory.

Test before you deploy

Search Console's robots.txt Tester catches syntax errors before they hit production. One typo can cost you weeks of traffic.

In-depth guide

A longer, practitioner-level breakdown of robots.txt best practices - written for readers who want the full picture, not just the summary above.

What robots.txt does and does not do

Robots.txt is a set of instructions to compliant web crawlers about which URLs they may fetch. Google, Bing, and every major search engine respects it. It does not enforce anything - it is voluntary compliance, not a security layer. Bad actors ignore robots.txt entirely, so it is not a mechanism for protecting sensitive URLs.

Critically, robots.txt controls crawling, not indexing. A URL blocked in robots.txt can still appear in Google's index if other sites link to it - the search result will just have no snippet because Google could not crawl the page to generate one. To prevent indexing, use a noindex meta tag on the page itself. To prevent both indexing and crawling, block in robots.txt after the noindex has been discovered and processed.

Robots.txt lives at the root of your domain (example.com/robots.txt) and is fetched before every significant crawl session. A single character error can cost you weeks of traffic - the file deserves careful editing and pre-deploy validation, not casual updates.

The rules that always apply

Never block CSS or JavaScript. Google needs to render your page like a user does, and blocking rendering resources makes the rendered HTML incomplete. This is the most common self-inflicted rankings wound we see. If your robots.txt disallows /js/ or /css/ or any wildcard that could catch rendering resources, remove the block immediately.

Always allow Googlebot access to your images, unless you have a specific reason to keep them out of Google Image Search. Image traffic is real, especially for visual categories (recipes, products, tutorials). Blocking /images/ by default is a common mistake in default hosting templates.

Always reference your sitemap at the bottom of robots.txt with a Sitemap: directive. This is the fastest way for crawlers (Google, Bing, DuckDuckGo, and others) to find your URL inventory. Multiple Sitemap: lines are valid if you split by content type.

Disallow rules for crawl budget, not security

The right use of Disallow is to save crawl budget on URLs that have no SEO value. Faceted navigation combinations that generate millions of URLs, session-ID parameters, calendar archives extending to arbitrary years, and internal search results are all common candidates.

The wrong use of Disallow is to hide URLs you consider private. Blocked URLs still appear in the index via inbound links, and the block signals to attackers exactly which URLs you consider sensitive - a robots.txt file is often the first thing a security researcher reads on a new target.

For genuinely private URLs, use authentication (a login wall) or the X-Robots-Tag: noindex HTTP header. Both are stronger than robots.txt and do not advertise the existence of the URL to anyone reading your robots.txt file.

Wildcards, path matching, and case sensitivity

Paths in robots.txt are case-sensitive. 'Disallow: /Admin' does not block '/admin'. Directives (User-agent, Disallow, Allow) are not case-sensitive but paths always are. When in doubt, use lowercase paths matching your actual URL structure.

Wildcards are supported: * matches any character sequence and $ matches the end of URL. 'Disallow: /*?' blocks any URL containing a query string. 'Disallow: /*.pdf$' blocks any URL ending in .pdf. These are powerful for pattern-based blocks but risky for typos - one misplaced wildcard can block your entire site.

Allow rules override Disallow rules within the same User-agent block. 'Disallow: /downloads/ Allow: /downloads/public/' blocks the parent but allows the child. Order matters less than specificity - the most specific rule wins.

AI crawler user agents in 2026

Beyond traditional search engines, robots.txt now must consider AI crawler user agents: GPTBot (OpenAI training), OAI-SearchBot (ChatGPT Search index), ChatGPT-User (on-demand fetches during chat), PerplexityBot (Perplexity index), ClaudeBot (Anthropic), Google-Extended (Gemini training), CCBot (Common Crawl), and others.

The decision for each is strategic. Allowing them makes your content eligible for AI training and/or citation. Blocking removes you from those systems. Blocking GPTBot but allowing OAI-SearchBot keeps you out of training but in search. Blocking Google-Extended keeps you out of Gemini training but does not affect Google Search.

Most publishers should allow the citation-oriented bots (OAI-SearchBot, PerplexityBot, ChatGPT-User) and think carefully before blocking the training-oriented ones. Being trained into models improves your citation eligibility even for future queries. Sites that blanket-block all AI crawlers usually lose more citation traffic than they gain in principled abstention.

Testing before you deploy

Search Console's Robots.txt Report catches most errors and warnings. Paste any URL and it tells you which rule applies and whether Googlebot is allowed. Use this every time you change the file before pushing to production. The test takes 30 seconds and prevents most disasters.

For batch testing, use a script that curls each of your top 50 URLs with Googlebot's user agent and checks the response. Any 200 for a URL that should be blocked, or any 403 for a URL that should be crawled, is a mismatch to investigate.

Version-control your robots.txt file. Every change should be a commit with a message explaining why. When traffic drops mysteriously three weeks after a robots.txt edit, the commit history is the fastest way to identify the culprit and roll back cleanly.

Common robots.txt mistakes and their symptoms

Mistake one: a wildcard block on / that never gets removed after launch. Symptom: entire site drops out of the index within days. Fix: remove the block immediately and request re-crawl of the sitemap. Recovery takes one to four weeks after the fix.

Mistake two: blocking /wp-admin/ or /admin/ with wildcard that catches /admin-guide/. Symptom: seemingly unrelated content pages become invisible. Fix: use more specific paths or Allow overrides for the collateral damage.

Mistake three: pointing Sitemap: at the wrong URL after a domain change. Symptom: Google reports zero URLs discovered from your sitemap. Fix: update the Sitemap: line to the current sitemap URL and re-submit in Search Console.