If AI crawlers cannot read your site, nothing else you do for generative engine optimization matters. Not your content structure. Not your schema markup. Not your link building. The most common — and most fixable — barrier to GenAI visibility is that the crawlers are blocked, often without anyone realizing it.
This guide walks through the two most important AI crawlers to unblock, why they get blocked in the first place, and exactly how to fix it.
The Two Crawlers That Matter Most
CCBot (Common Crawl)
CCBot powers the largest open web crawl feeding LLM training data. 64% of analyzed LLMs used Common Crawl data. If CCBot cannot access your domain, your Harmonic Centrality drops to zero and LLMs will not encounter your brand during pre-training.
CCBot is also the most widely blocked AI crawler among the top 1,000 websites — frequently by accident. Cloudflare’s “AI Scrapers and Crawlers” toggle, introduced in 2024, blocks CCBot with a single click. Many site owners enabled it to stop aggressive scrapers without realizing they were also cutting themselves off from the primary LLM training pipeline.
GPTBot (OpenAI)
GPTBot crawls content for OpenAI’s model training. It is separate from OAI-SearchBot (real-time ChatGPT Search retrieval) and ChatGPT-User (user-triggered page fetches). Blocking GPTBot prevents your content from entering OpenAI’s training pipeline, but does not block you from ChatGPT Search — that depends on OAI-SearchBot access. After OpenAI announced GPTBot, hundreds of major sites promptly blocked it. Many smaller sites followed without evaluating whether blocking made strategic sense.
Why Sites Get Blocked Without Knowing
Three common scenarios cause unintentional blocking:
Cloudflare’s bot management defaults. The “AI Scrapers and Crawlers” toggle blocks a broad list of AI bots with one click. If your team enabled it, CCBot and GPTBot are blocked at the CDN level — before they reach your server. Your robots.txt may say “Allow,” but the request never gets through.
Overly broad robots.txt rules. A wildcard User-agent: * with Disallow: / blocks everything, including AI crawlers. Some sites add this during staging and forget to remove it. Others inherit it from CMS templates or security plugins.
WAF and security rules. Web Application Firewalls can rate-limit or block AI crawlers based on behavior patterns, especially when they crawl faster than traditional search bots.
How to Audit and Fix Crawler Access
Step 1: Check Your robots.txt
Visit yourdomain.com/robots.txt and search for these user agents:
| User Agent | Platform | Purpose |
|---|---|---|
CCBot | Common Crawl | Training data for most LLMs |
GPTBot | OpenAI | Training data for GPT models |
OAI-SearchBot | OpenAI | Real-time ChatGPT Search retrieval |
ClaudeBot | Anthropic | Training data for Claude |
PerplexityBot | Perplexity | Real-time search and citations |
Google-Extended | Training data for Gemini |
If any are listed with Disallow: /, change to Allow: /. If none are listed and your wildcard rule blocks all bots, add explicit allow rules for each AI crawler.
Step 2: Check Your CDN and WAF
If you use Cloudflare, navigate to Security > Bots > AI Scrapers and Crawlers and verify the toggle is off (or selectively configured). If you use Akamai or another CDN, review bot management rules for AI crawler user agent strings.
The CDN check is critical because it operates at a layer above your robots.txt. A request blocked by the CDN never reaches your server, so your robots.txt rules are irrelevant.
Step 3: Verify with Server Logs
The definitive test is not what your configuration says — it is what your logs show. Search for these user agent strings in your access logs: CCBot, GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended. Look for HTTP 200 responses. If you see 403s, 429s, or no entries at all, the crawlers are being blocked somewhere in your stack.
Tools like CrawlerCheck and CheckAIBots can test specific URLs against multiple AI crawlers instantly.
The Strategic Middle Ground
Not every site needs to allow every crawler. You can make granular decisions:
- Allow search bots, block training bots. Permit OAI-SearchBot and PerplexityBot (for AI search results) while blocking GPTBot and ClaudeBot (to prevent model training). This preserves retrieval-layer visibility while opting out of the training data layer.
- Allow everything. For maximum GenAI visibility across both layers, allow all AI crawlers. This is the approach that maximizes long-term compounding.
The tradeoff is real. Blocking training crawlers protects your content from being used in model training, but it also means the model will not build associations with your brand during pre-training. For brands that want to be cited in AI responses, the visibility benefit of allowing crawler access typically outweighs the content protection concern.
After You Unblock: What Happens Next
Once AI crawlers can access your site, the timeline depends on the layer:
- Retrieval layer (OAI-SearchBot, PerplexityBot): Your pages can start appearing in AI search citations within days to weeks of being indexed.
- Training data layer (CCBot, GPTBot): Your content enters the next crawl cycle, but LLMs will not reflect it until their next retraining — a lag measured in months.
Unblocking crawlers is the prerequisite. What you do with that access — through content structure, web graph positioning, and syndication — determines whether the models actually cite you. Platforms like PhantomRank track whether unblocking translates into actual citations.
For the complete technical audit, see The Technical Checklist to Optimize for GenAI Crawler Access. For the full discipline, explore our guide to generative engine optimization.