Learning Path
Navigate the AXO curriculum
Fundamentals
Implementation
AI Opt-Out Controls
Comprehensive templates and server-side controls to limit AI crawling, training data collection, and content referencing from your website.
AI Crawling Scale
Reality Check
- robots.txt is widely respected by reputable crawlers, but not universally enforced.
- ai.txt and llms.txt are emerging proposals; adoption varies across vendors.
- Some crawlers may spoof or rotate user-agents. Consider server-side controls and monitoring.
- "Training" vs. "inference" are different activities—decide which you want to allow or block.
- Recent reports indicate some bots evade blocks or ignore robots.txt entirely.
Training vs. On-Demand Fetch
Training Crawlers
- •
GPTBot
- OpenAI model training - •
CCBot
- Common Crawl for training - •
ClaudeBot
- Anthropic training data - • Usually respect robots.txt
- • Bulk, systematic crawling
Inference/User Crawlers
- •
ChatGPT-User
- Real-time browsing - •
PerplexityBot
- Search results - •
Claude-User
- User-requested content - • May ignore robots.txt for user requests
- • Targeted, on-demand fetching
robots.txt Templates
Place at /robots.txt
. These examples target common AI-related crawlers. Adjust to your policy.
Block Specific AI Crawlers
# Block selected AI crawlers from all content
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
# Google-Extended is used for AI training access signals
User-agent: Google-Extended
Disallow: /
# Allow everyone else
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Block All Except Major Search
# Allow conventional search engines, block common AI training crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Disallow: /
ai.txt (Emerging)
Place at /ai.txt
. The ai.txt
file communicates site AI usage policies. It is not a universal standard, but some vendors check it.
# ai.txt — Site policy for AI usage
# Scope: training (model development) vs inference (runtime usage)
owner: Your Company Name
contact: legal@yourdomain.com
policy: https://yourdomain.com/policies/ai-usage
# Explicit preferences
allow: inference
disallow: training
# Notes for crawlers
note: Please respect robots.txt and do not store content for model training.
updated: 2025-01-01
Tip: Keep your policy URL live and human-readable. Log requests to /ai.txt
to see which clients check it.
Server-Side Controls (Optional)
Add mitigations beyond policy files. Examples shown for Nginx; adapt to your stack.
Block by User-Agent
# nginx.conf
map $http_user_agent $block_ai {
default 0;
~*(GPTBot|PerplexityBot|ClaudeBot|CCBot|ChatGPT-User|Google-Extended) 1;
}
server {
if ($block_ai) { return 403; }
# ... rest of config
}
Rate Limit Suspicious Bots
# nginx rate limit example
limit_req_zone $binary_remote_addr zone=botlimit:10m rate=1r/s;
server {
location / {
if ($block_ai) { limit_req zone=botlimit burst=5 nodelay; }
}
}
Warning: User-agents can be spoofed. Combine with IP reputation, behavior analysis, and WAF rules if this is critical.
Monitoring & Verification
- Log access to
/robots.txt
,/ai.txt
, and high-value content paths. - Alert on spikes in known AI user-agents (e.g., GPTBot, PerplexityBot, ClaudeBot, CCBot, Google-Extended).
- Watch for mismatches between user-agent and behavior (e.g., excessive parallel fetches).
- Document policy changes and include
lastmod
in yoursitemap.xml
for transparency.
Training vs. Inference
"Training" means using your content to create or update models. "Inference" means models consult your content at runtime. You may choose to block training but allow inference (or vice versa). Policies and enforcement are evolving—review vendor docs and legal guidance regularly.
References & Further Reading
- robots.txt documentation (Google, Bing)
- OpenAI GPTBot policy
- PerplexityBot documentation
- Common Crawl (CCBot) info
- Google-Extended announcement for AI training controls
- Coverage on emerging
ai.txt
andllms.txt
proposals