Robots.txt and AI Crawlers: The 2-Minute Fix

Your Robots.txt Is Probably Hiding You from AI

We audited a SaaS company last quarter that had solid SEO, clean structured data, and a well-maintained blog. But when we asked ChatGPT and Claude about their product category, they were invisible. Not mentioned. Not cited. Not even close.

The culprit was two lines in their robots.txt file. They'd blocked GPTBot and ClaudeBot a year earlier after reading a blog post about protecting their content from AI training. Fair enough. But the AI crawlers had changed since then. Those two lines were now blocking search retrieval bots too, not just training crawlers. Their content was locked out of every AI answer.

A two-minute edit fixed it. They were showing up in AI-powered search within weeks.

Here's exactly what you need to know about robots.txt and AI crawlers, and the specific lines to add or change.

Why Robots.txt Matters for AI Visibility

Robots.txt has been around since 1994. It tells web crawlers what they can and can't access on your site. Simple text file, sits at your root domain (yoursite.com/robots.txt), and every well-behaved bot reads it before doing anything else.

The problem: AI companies now run multiple crawlers, each with a different purpose. OpenAI alone operates three. Anthropic has three. Google has a separate one just for AI training. And the user-agent strings change. If you set a blanket block a year ago, you're almost certainly blocking things you don't intend to.

Worse, many sites have no AI-specific rules at all. The default robots.txt most CMS platforms generate doesn't mention GPTBot, ClaudeBot, or any of the newer crawlers. That means you're leaving it up to chance whether your content shows up in ChatGPT's recommendations or Claude's answers.

Training Bots vs. Search Bots: The Distinction That Matters

This is where most guides get it wrong. They treat all AI crawlers the same. Block them all or allow them all. But AI crawlers now fall into two categories, and treating them identically is a mistake.

Training crawlers download your content to include in future model training datasets. If you block these, your content won't be used to train the next version of GPT or Claude. It won't disappear from existing models, but it won't appear in new ones either.

Search and retrieval crawlers fetch your content in real time to answer user questions right now. Block these, and your pages won't show up when someone asks ChatGPT or Perplexity a question that your content could answer. This is the one that directly affects your visibility today.

The smart move for most businesses: block training crawlers if you want to protect your intellectual property, but allow search and retrieval bots so your content can still be cited in AI-powered answers. If you're trying to improve your AI visibility, blocking retrieval bots is like taking your site offline for an entire search channel.

Every AI Crawler You Need to Know

Here's the current list of major AI crawlers, what they do, and the exact user-agent string for your robots.txt. This is accurate as of early 2026, but these change, so check back quarterly.

OpenAI (ChatGPT)

OpenAI runs three crawlers:

GPTBot - Training crawler. Downloads content for model training. Block this if you don't want your content used to train future GPT models.
OAI-SearchBot - Search retrieval. Powers ChatGPT's search results and citations. Block this and your pages disappear from ChatGPT search answers.
ChatGPT-User - User-initiated. Fetches pages when a ChatGPT user specifically asks about a URL or topic that requires live web access.

Anthropic (Claude)

Anthropic also runs three, mirroring OpenAI's structure:

ClaudeBot - Training crawler. Collects content for Claude's model training.
Claude-SearchBot - Search retrieval. Indexes your content so Claude can cite it in search-powered answers.
Claude-User - User-initiated. Fires when a Claude user requests live web content.

Note: Anthropic previously used anthropic-ai and Claude-Web as user agents. Both are deprecated but worth keeping in your robots.txt for backward compatibility.

Google (Gemini)

Google-Extended - Training crawler for Gemini models and Vertex AI. Blocking this prevents your content from being used in Gemini training, but does not affect your Google Search rankings. Googlebot is separate.

Others Worth Including

PerplexityBot - Perplexity AI's crawler for both indexing and retrieval.
Bytespider - ByteDance's crawler, used to train AI models.
CCBot - Common Crawl's bot. Many AI models use Common Crawl data for training.
Amazonbot - Amazon's crawler, used for Alexa and AI features.
Applebot-Extended - Apple's AI training crawler, separate from Applebot which handles Siri and Spotlight.
cohere-ai - Cohere's training crawler.

The Robots.txt Snippets You Actually Need

Here are three ready-to-use configurations. Pick the one that matches your situation and paste it into your robots.txt file.

Option 1: Allow Everything (Maximum AI Visibility)

Use this if your primary goal is showing up in AI answers and you're not concerned about training data usage.

# AI Crawlers - Allow All
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Applebot-Extended
Allow: /

Option 2: Block Training, Allow Search (Recommended for Most Sites)

This is what we recommend for most businesses. It keeps your content out of future training datasets while making sure you still appear in AI-powered search results.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow AI search and retrieval bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

Option 3: Selective Access (Allow AI Crawlers on Some Pages Only)

If you want AI bots to index your blog and public docs but stay away from proprietary content, use path-level rules.

# Allow AI search bots on blog and public docs only
User-agent: OAI-SearchBot
Allow: /blog/
Allow: /docs/
Disallow: /

User-agent: Claude-SearchBot
Allow: /blog/
Allow: /docs/
Disallow: /

User-agent: PerplexityBot
Allow: /blog/
Allow: /docs/
Disallow: /

# Block all training crawlers entirely
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

How to Update Your Robots.txt (The Actual Two-Minute Fix)

This is not complicated. Here's the process:

1. Find your robots.txt file. It lives at your site root. Check yoursite.com/robots.txt in a browser. If you see a file, great. If you get a 404, you need to create one.

2. Check what's already there. Look for any existing User-agent lines mentioning GPTBot, ClaudeBot, or similar. If you see Disallow: / for bots you want to allow, that's your problem.

3. Add the new rules. Copy the snippet from Option 1, 2, or 3 above and paste it into your file. Put AI-specific rules below your existing Googlebot and general rules.

4. Test it. Google Search Console has a robots.txt tester. Use it. Or just revisit yoursite.com/robots.txt in a browser and make sure the formatting looks clean.

5. Wait. Crawlers don't re-read your robots.txt instantly. Give it a few days to a few weeks for changes to take effect. OpenAI and Anthropic don't publish exact re-crawl schedules, but we typically see changes reflected within two to four weeks.

What Robots.txt Can't Do

A few things to keep honest about.

Robots.txt is a request, not a wall. It tells crawlers what you'd prefer. Well-behaved bots from OpenAI, Google, and Anthropic respect it. But there's no technical enforcement. A rogue scraper can ignore it entirely.

Also, newer AI browser agents like ChatGPT's Atlas browsing feature use standard Chrome user-agent strings. They look like regular web traffic. Your robots.txt rules for GPTBot won't catch these. There's currently no clean solution for this, and it's worth knowing.

And blocking a training crawler doesn't retroactively remove your content from existing models. If GPT-4 was trained on your content before you blocked GPTBot, that content is already in the model. Blocking prevents future training runs from including new content, but it doesn't rewind anything.

For deeper context on how to make your content readable to AI systems, not just crawlable, check our piece on what llms.txt actually does. The two files solve different problems.

What to Do Right Now

Open your robots.txt. It will take you less time to fix it than it took to read this article.

If you have no AI crawler rules, add them. Option 2 above is the right default for most businesses. If you have blanket blocks from a year ago, update them to distinguish between training and search bots. And if you're unsure what your current AI visibility looks like, test it.

The robots.txt fix is the lowest-effort, highest-impact change you can make for AI visibility. It's not the only thing that matters. Your content quality, structured data, and brand authority all play a role. But if the door is locked, none of that matters. Open the door first.

Run the free AI visibility scan to check if AI crawlers can actually reach your site. Takes 60 seconds.