Learning Path

Navigate the AXO curriculum

AI Opt-Out Controls

Comprehensive templates and server-side controls to limit AI crawling, training data collection, and content referencing from your website.

AI Crawling Scale

15TB

of web data crawled daily by AI bots

Source: Cloudflare 2024

68%

of websites have no AI crawling controls

Source: BrightEdge AI Crawler Study 2024

23%

of AI bots ignore robots.txt directives

Source: Originality.ai Crawler Report 2024

Reality Check

robots.txt is widely respected by reputable crawlers, but not universally enforced.
ai.txt and llms.txt are emerging proposals; adoption varies across vendors.
Some crawlers may spoof or rotate user-agents. Consider server-side controls and monitoring.
"Training" vs. "inference" are different activities—decide which you want to allow or block.
Recent reports indicate some bots evade blocks or ignore robots.txt entirely.

Training vs. On-Demand Fetch

Training Crawlers

• GPTBot - OpenAI model training
• CCBot - Common Crawl for training
• ClaudeBot - Anthropic training data
• Usually respect robots.txt
• Bulk, systematic crawling

Inference/User Crawlers

• ChatGPT-User - Real-time browsing
• PerplexityBot - Search results
• Claude-User - User-requested content
• May ignore robots.txt for user requests
• Targeted, on-demand fetching

robots.txt Templates

Place at /robots.txt. These examples target common AI-related crawlers. Adjust to your policy.

Block Specific AI Crawlers

# Block selected AI crawlers from all content
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Google-Extended is used for AI training access signals
User-agent: Google-Extended
Disallow: /

# Allow everyone else
User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block All Except Major Search

# Allow conventional search engines, block common AI training crawlers
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

User-agent: GPTBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /

User-agent: *
Disallow: /

ai.txt (Emerging)

Place at /ai.txt. The ai.txt file communicates site AI usage policies. It is not a universal standard, but some vendors check it.

# ai.txt — Site policy for AI usage
# Scope: training (model development) vs inference (runtime usage)

owner: Your Company Name
contact: legal@yourdomain.com
policy: https://yourdomain.com/policies/ai-usage

# Explicit preferences
allow: inference
disallow: training

# Notes for crawlers
note: Please respect robots.txt and do not store content for model training.
updated: 2025-01-01

Tip: Keep your policy URL live and human-readable. Log requests to /ai.txt to see which clients check it.

Server-Side Controls (Optional)

Add mitigations beyond policy files. Examples shown for Nginx; adapt to your stack.

Block by User-Agent

# nginx.conf
map $http_user_agent $block_ai {
  default 0;
  ~*(GPTBot|PerplexityBot|ClaudeBot|CCBot|ChatGPT-User|Google-Extended) 1;
}

server {
  if ($block_ai) { return 403; }
  # ... rest of config
}

Rate Limit Suspicious Bots

# nginx rate limit example
limit_req_zone $binary_remote_addr zone=botlimit:10m rate=1r/s;

server {
  location / {
    if ($block_ai) { limit_req zone=botlimit burst=5 nodelay; }
  }
}

Warning: User-agents can be spoofed. Combine with IP reputation, behavior analysis, and WAF rules if this is critical.

Monitoring & Verification

Log access to /robots.txt, /ai.txt, and high-value content paths.
Alert on spikes in known AI user-agents (e.g., GPTBot, PerplexityBot, ClaudeBot, CCBot, Google-Extended).
Watch for mismatches between user-agent and behavior (e.g., excessive parallel fetches).
Document policy changes and include lastmod in your sitemap.xml for transparency.

Training vs. Inference

"Training" means using your content to create or update models. "Inference" means models consult your content at runtime. You may choose to block training but allow inference (or vice versa). Policies and enforcement are evolving—review vendor docs and legal guidance regularly.

References & Further Reading

robots.txt documentation (Google, Bing)
OpenAI GPTBot policy
PerplexityBot documentation
Common Crawl (CCBot) info
Google-Extended announcement for AI training controls
Coverage on emerging ai.txt and llms.txt proposals