Skip to content
Back to optimize-for-genai

If AI crawlers cannot read your site, nothing else you do for generative engine optimization matters. Not your content structure. Not your schema markup. Not your link building. The most common — and most fixable — barrier to GenAI visibility is that the crawlers are blocked, often without anyone realizing it.

This guide walks through the two most important AI crawlers to unblock, why they get blocked in the first place, and exactly how to fix it.

The Two Crawlers That Matter Most

CCBot (Common Crawl)

CCBot powers the largest open web crawl feeding LLM training data. 64% of analyzed LLMs used Common Crawl data. If CCBot cannot access your domain, your Harmonic Centrality drops to zero and LLMs will not encounter your brand during pre-training.

CCBot is also the most widely blocked AI crawler among the top 1,000 websites — frequently by accident. Cloudflare’s “AI Scrapers and Crawlers” toggle, introduced in 2024, blocks CCBot with a single click. Many site owners enabled it to stop aggressive scrapers without realizing they were also cutting themselves off from the primary LLM training pipeline.

GPTBot (OpenAI)

GPTBot crawls content for OpenAI’s model training. It is separate from OAI-SearchBot (real-time ChatGPT Search retrieval) and ChatGPT-User (user-triggered page fetches). Blocking GPTBot prevents your content from entering OpenAI’s training pipeline, but does not block you from ChatGPT Search — that depends on OAI-SearchBot access. After OpenAI announced GPTBot, hundreds of major sites promptly blocked it. Many smaller sites followed without evaluating whether blocking made strategic sense.

Why Sites Get Blocked Without Knowing

Three common scenarios cause unintentional blocking:

Cloudflare’s bot management defaults. The “AI Scrapers and Crawlers” toggle blocks a broad list of AI bots with one click. If your team enabled it, CCBot and GPTBot are blocked at the CDN level — before they reach your server. Your robots.txt may say “Allow,” but the request never gets through.

Overly broad robots.txt rules. A wildcard User-agent: * with Disallow: / blocks everything, including AI crawlers. Some sites add this during staging and forget to remove it. Others inherit it from CMS templates or security plugins.

WAF and security rules. Web Application Firewalls can rate-limit or block AI crawlers based on behavior patterns, especially when they crawl faster than traditional search bots.

How to Audit and Fix Crawler Access

Step 1: Check Your robots.txt

Visit yourdomain.com/robots.txt and search for these user agents:

User AgentPlatformPurpose
CCBotCommon CrawlTraining data for most LLMs
GPTBotOpenAITraining data for GPT models
OAI-SearchBotOpenAIReal-time ChatGPT Search retrieval
ClaudeBotAnthropicTraining data for Claude
PerplexityBotPerplexityReal-time search and citations
Google-ExtendedGoogleTraining data for Gemini

If any are listed with Disallow: /, change to Allow: /. If none are listed and your wildcard rule blocks all bots, add explicit allow rules for each AI crawler.

Step 2: Check Your CDN and WAF

If you use Cloudflare, navigate to Security > Bots > AI Scrapers and Crawlers and verify the toggle is off (or selectively configured). If you use Akamai or another CDN, review bot management rules for AI crawler user agent strings.

The CDN check is critical because it operates at a layer above your robots.txt. A request blocked by the CDN never reaches your server, so your robots.txt rules are irrelevant.

Step 3: Verify with Server Logs

The definitive test is not what your configuration says — it is what your logs show. Search for these user agent strings in your access logs: CCBot, GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended. Look for HTTP 200 responses. If you see 403s, 429s, or no entries at all, the crawlers are being blocked somewhere in your stack.

Tools like CrawlerCheck and CheckAIBots can test specific URLs against multiple AI crawlers instantly.

The Strategic Middle Ground

Not every site needs to allow every crawler. You can make granular decisions:

  • Allow search bots, block training bots. Permit OAI-SearchBot and PerplexityBot (for AI search results) while blocking GPTBot and ClaudeBot (to prevent model training). This preserves retrieval-layer visibility while opting out of the training data layer.
  • Allow everything. For maximum GenAI visibility across both layers, allow all AI crawlers. This is the approach that maximizes long-term compounding.

The tradeoff is real. Blocking training crawlers protects your content from being used in model training, but it also means the model will not build associations with your brand during pre-training. For brands that want to be cited in AI responses, the visibility benefit of allowing crawler access typically outweighs the content protection concern.

After You Unblock: What Happens Next

Once AI crawlers can access your site, the timeline depends on the layer:

  • Retrieval layer (OAI-SearchBot, PerplexityBot): Your pages can start appearing in AI search citations within days to weeks of being indexed.
  • Training data layer (CCBot, GPTBot): Your content enters the next crawl cycle, but LLMs will not reflect it until their next retraining — a lag measured in months.

Unblocking crawlers is the prerequisite. What you do with that access — through content structure, web graph positioning, and syndication — determines whether the models actually cite you. Platforms like PhantomRank track whether unblocking translates into actual citations.

For the complete technical audit, see The Technical Checklist to Optimize for GenAI Crawler Access. For the full discipline, explore our guide to generative engine optimization.