AI Bot Access: Making Sure Crawlers Can Actually See Your Site

I audited a luxury hotel group's technical setup last year and found a single line in their robots.txt that was blocking GPTBot, ClaudeBot, and PerplexityBot from accessing the entire site. Their content team had spent months building AEO-optimized pages. None of it mattered because AI crawlers could not read a single page. The block had been added by a developer during a security update and no one on the marketing team knew it was there.

This is more common than you would expect. Paul Calvano's analysis of HTTP Archive data found that over 560,000 websites now include AI bot directives in their robots.txt files, and a significant portion of those are disallow rules. Playwire's publisher guide notes that GPTBot saw a 305% increase in request volume between May 2024 and May 2025, meaning AI crawlers are actively looking for your content. If they cannot access it, you are invisible to the fastest-growing discovery channel in digital marketing.

The Bots You Need to Know

Each AI platform operates its own fleet of crawlers, and each serves a different purpose. Understanding the distinction matters because you may want to allow some and restrict others.

OpenAI runs three: GPTBot collects content for model training, OAI-SearchBot indexes content for ChatGPT's search features, and ChatGPT-User activates only when a human requests your content through the interface. Anthropic operates ClaudeBot for training and citation fetching, plus Claude-Web for broader web crawling. Perplexity uses PerplexityBot to build its search index and Perplexity-User for on-demand page fetching when someone clicks a citation. Google uses Google-Extended as a control token for Gemini AI data collection, separate from the standard Googlebot that handles traditional search indexing.

Qwairy's crawler guide notes that each of these bots reflects a different knowledge base and crawl pattern. Blocking one does not block them all. And the user-triggered bots like ChatGPT-User and Perplexity-User may not fully respect robots.txt directives because they activate in response to a direct human request.

Configuring Your robots.txt

The minimum configuration for AEO visibility explicitly allows the major AI crawlers. Your robots.txt should include allow rules for GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, Perplexity-User, and Google-Extended. BotRank's configuration guide emphasizes that the file must not be empty and should contain at minimum a User-agent wildcard with Allow directives as a baseline, then specific rules for each AI bot.

If you want AI platforms to cite your content in search responses but prefer not to have it used for model training, you can use a mixed policy. Allow OAI-SearchBot and ChatGPT-User while blocking GPTBot. Allow PerplexityBot while monitoring volume. The distinction gives you control without sacrificing visibility.

GenRank's optimization guide recommends using path-specific rules if you want AI crawlers to access public content but not sensitive directories. For example, allowing access to /blog/ and /guides/ while disallowing /internal/ or /members/. This keeps your best content accessible while protecting what should remain private.

The Cloudflare Problem

A specific and increasingly common issue: Cloudflare's security settings can default to blocking AI bots even when your robots.txt explicitly allows them. The Am I Cited audit template flags this directly, noting that many site owners discovered their content became invisible to ChatGPT and Perplexity after Cloudflare updates they did not initiate.

If you use Cloudflare, check your WAF (Web Application Firewall) rules. Perplexity's own documentation provides configuration guidelines for whitelisting their bots through Cloudflare's custom rules. The same applies to AWS WAF and other firewall services. Your robots.txt can say "welcome" while your firewall says "blocked," and the firewall wins.

JavaScript Rendering and Server-Side Content

Robots.txt access is necessary but not sufficient. AI crawlers have limited ability to execute JavaScript. If your key content loads dynamically through client-side scripts, AI bots may crawl the page and find nothing useful. View your page source directly. If the content is not in the raw HTML, AI cannot see it.

Server-side rendering or static HTML ensures your content is available in the initial page load. Infinite scroll, JavaScript-triggered content sections, and dynamically loaded FAQ accordions are all patterns that can hide content from AI crawlers. Check your most important pages by disabling JavaScript in your browser. If the core content disappears, it is invisible to AI.

Monitoring What Bots Actually See

Allowing access and confirming access are different steps. Check your server logs for the specific user-agent strings associated with AI crawlers. Momentic's crawler guide recommends a simple grep command against your access logs filtering for GPTBot, ClaudeBot, and PerplexityBot to verify they are actually reaching your site.

If you expect to see a specific bot but it is not appearing, either your robots.txt or firewall is still blocking it, or the bot has not discovered your content yet. Ensure your XML sitemap is current and submitted. AI crawlers use sitemaps the same way traditional search bots do: to discover new and updated content faster.

Review your configuration quarterly. AI companies frequently introduce new crawlers and consolidate existing ones. Anthropic merged their earlier bot identifiers into ClaudeBot, which temporarily gave the new bot unrestricted access on sites that had only blocked the old names. Staying current prevents both accidental blocks and unintended access.

Sources: Paul Calvano, "AI Bots and Robots.txt" (2025), Qwairy, "Understanding AI Crawlers: Complete Guide" (2025), BotRank, "Robots.txt Guide for AI Ranking" (2025), GenRank, "Optimizing Your Robots.txt for Generative AI Crawlers" (2025), Playwire, "How to Block AI Bots with robots.txt: The Complete Publisher's Guide" (2025), Momentic, "List of Top AI Search Crawlers + User Agents" (2025)

CONTACT US

I audited a luxury hotel group's technical setup last year and found a single line in their robots.txt that was blocking GPTBot, ClaudeBot, and PerplexityBot from accessing the entire site. Their content team had spent months building AEO-optimized pages. None of it mattered because AI crawlers could not read a single page. The block had been added by a developer during a security update and no one on the marketing team knew it was there.

This is more common than you would expect. Paul Calvano's analysis of HTTP Archive data found that over 560,000 websites now include AI bot directives in their robots.txt files, and a significant portion of those are disallow rules. Playwire's publisher guide notes that GPTBot saw a 305% increase in request volume between May 2024 and May 2025, meaning AI crawlers are actively looking for your content. If they cannot access it, you are invisible to the fastest-growing discovery channel in digital marketing.

The Bots You Need to Know

Each AI platform operates its own fleet of crawlers, and each serves a different purpose. Understanding the distinction matters because you may want to allow some and restrict others.

OpenAI runs three: GPTBot collects content for model training, OAI-SearchBot indexes content for ChatGPT's search features, and ChatGPT-User activates only when a human requests your content through the interface. Anthropic operates ClaudeBot for training and citation fetching, plus Claude-Web for broader web crawling. Perplexity uses PerplexityBot to build its search index and Perplexity-User for on-demand page fetching when someone clicks a citation. Google uses Google-Extended as a control token for Gemini AI data collection, separate from the standard Googlebot that handles traditional search indexing.

Qwairy's crawler guide notes that each of these bots reflects a different knowledge base and crawl pattern. Blocking one does not block them all. And the user-triggered bots like ChatGPT-User and Perplexity-User may not fully respect robots.txt directives because they activate in response to a direct human request.

Configuring Your robots.txt

The minimum configuration for AEO visibility explicitly allows the major AI crawlers. Your robots.txt should include allow rules for GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, Perplexity-User, and Google-Extended. BotRank's configuration guide emphasizes that the file must not be empty and should contain at minimum a User-agent wildcard with Allow directives as a baseline, then specific rules for each AI bot.

If you want AI platforms to cite your content in search responses but prefer not to have it used for model training, you can use a mixed policy. Allow OAI-SearchBot and ChatGPT-User while blocking GPTBot. Allow PerplexityBot while monitoring volume. The distinction gives you control without sacrificing visibility.

GenRank's optimization guide recommends using path-specific rules if you want AI crawlers to access public content but not sensitive directories. For example, allowing access to /blog/ and /guides/ while disallowing /internal/ or /members/. This keeps your best content accessible while protecting what should remain private.

The Cloudflare Problem

A specific and increasingly common issue: Cloudflare's security settings can default to blocking AI bots even when your robots.txt explicitly allows them. The Am I Cited audit template flags this directly, noting that many site owners discovered their content became invisible to ChatGPT and Perplexity after Cloudflare updates they did not initiate.

If you use Cloudflare, check your WAF (Web Application Firewall) rules. Perplexity's own documentation provides configuration guidelines for whitelisting their bots through Cloudflare's custom rules. The same applies to AWS WAF and other firewall services. Your robots.txt can say "welcome" while your firewall says "blocked," and the firewall wins.

JavaScript Rendering and Server-Side Content

Robots.txt access is necessary but not sufficient. AI crawlers have limited ability to execute JavaScript. If your key content loads dynamically through client-side scripts, AI bots may crawl the page and find nothing useful. View your page source directly. If the content is not in the raw HTML, AI cannot see it.

Server-side rendering or static HTML ensures your content is available in the initial page load. Infinite scroll, JavaScript-triggered content sections, and dynamically loaded FAQ accordions are all patterns that can hide content from AI crawlers. Check your most important pages by disabling JavaScript in your browser. If the core content disappears, it is invisible to AI.

Monitoring What Bots Actually See

Allowing access and confirming access are different steps. Check your server logs for the specific user-agent strings associated with AI crawlers. Momentic's crawler guide recommends a simple grep command against your access logs filtering for GPTBot, ClaudeBot, and PerplexityBot to verify they are actually reaching your site.

If you expect to see a specific bot but it is not appearing, either your robots.txt or firewall is still blocking it, or the bot has not discovered your content yet. Ensure your XML sitemap is current and submitted. AI crawlers use sitemaps the same way traditional search bots do: to discover new and updated content faster.

Review your configuration quarterly. AI companies frequently introduce new crawlers and consolidate existing ones. Anthropic merged their earlier bot identifiers into ClaudeBot, which temporarily gave the new bot unrestricted access on sites that had only blocked the old names. Staying current prevents both accidental blocks and unintended access.

Sources: Paul Calvano, "AI Bots and Robots.txt" (2025), Qwairy, "Understanding AI Crawlers: Complete Guide" (2025), BotRank, "Robots.txt Guide for AI Ranking" (2025), GenRank, "Optimizing Your Robots.txt for Generative AI Crawlers" (2025), Playwire, "How to Block AI Bots with robots.txt: The Complete Publisher's Guide" (2025), Momentic, "List of Top AI Search Crawlers + User Agents" (2025)