AI Robots & Crawlers

Managing AI bot access to your content is a critical part of AEO. Understanding how different AI crawlers work and properly configuring your robots.txt and llms.txt files can significantly impact your visibility.

The New Crawler Landscape

In the early years of search engines, many media companies locked out Google's indexing bots—a decision many came to regret. Today, the same pattern is emerging with AI crawlers.

Key question: Can you afford to stay invisible to AI?

Known AI Crawlers

Bot Name	Company	Purpose
GPTBot	OpenAI	ChatGPT training and search
ChatGPT-User	OpenAI	Real-time browsing
ClaudeBot	Anthropic	Claude training
PerplexityBot	Perplexity	Search indexing
Googlebot	Google	Including Gemini/AI Overview
Bingbot	Microsoft	Including Copilot
CCBot	Common Crawl	General training data
Applebot	Apple	Siri and AI features
cohere-ai	Cohere	LLM training
FacebookBot	Meta	AI training

robots.txt Configuration

Allowing AI Crawlers

To maximize AI visibility, allow all major AI bots:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: CCBot
Allow: /

# Sitemap location
Sitemap: https://yourdomain.com/sitemap.xml

Selective Access

If you want to allow AI search but block training:

# Allow real-time search browsing
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

Protecting Specific Content

Block sensitive areas while allowing general access:

User-agent: GPTBot
Allow: /
Disallow: /private/
Disallow: /admin/
Disallow: /api/

User-agent: *
Allow: /
Disallow: /private/
Disallow: /admin/

The llms.txt Standard

llms.txt is a new standard specifically designed for LLM crawlers—similar to what robots.txt does for search engines.

Basic llms.txt Structure

# Site Information
site_name: Your Site Name
site_url: https://yourdomain.com
site_description: Brief description of your site

# Contact
contact_email: contact@yourdomain.com

# Entry Points
primary_entry: https://yourdomain.com
documentation_root: https://yourdomain.com/docs
api_documentation: https://yourdomain.com/api

# Sitemap
sitemap_url: https://yourdomain.com/sitemap.xml

# Sections
sections:
  - Getting Started: /docs/getting-started
  - API Reference: /api
  - Guides: /guides

llms.txt Best Practices

Place at site root — /llms.txt
Keep updated — Reflect current site structure
Include key entry points — Help AI find important content
Describe your site clearly — Help AI understand context

Complete llms.txt Guide

Technical Crawling Challenges for AI

JavaScript Rendering Issues

AI crawlers have limitations compared to traditional search crawlers:

What AI crawlers struggle with:

JavaScript-rendered content
Dynamic content loading
Single Page Application (SPA) content
Third-party scripts (like Google Tag Manager)
Client-side routing

The Problem

// This content may not be visible to AI crawlers
useEffect(() => {
  fetch('/api/content')
    .then(res => res.json())
    .then(data => setContent(data));
}, []);

Solutions

Server-Side Rendering (SSR):

// Next.js example - content rendered server-side
export async function getServerSideProps() {
  const content = await fetchContent();
  return { props: { content } };
}

Static Generation:

// Next.js example - content generated at build time
export async function getStaticProps() {
  const content = await fetchContent();
  return { props: { content } };
}

Progressive Enhancement:

Serve core content as HTML
Enhance with JavaScript
Ensure content works without JS

JSON-LD and AI Crawlers

JSON-LD structured data is increasingly important for AI visibility:

Why JSON-LD Matters

Provides clean, structured context
Helps AI understand entities and relationships
Works without JavaScript execution
Machine-readable format

Ensuring JSON-LD is Accessible

<!-- Include in HTML head - doesn't require JS -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Your Company",
  "url": "https://yourdomain.com"
}
</script>

Avoid: Injecting JSON-LD via JavaScript

// This may not be seen by AI crawlers
document.head.innerHTML += `
  <script type="application/ld+json">
    ${JSON.stringify(schema)}
  </script>
`;

Verifying Crawler Access

Testing Your Configuration

Use robots.txt testers — Verify rules work as intended
Check server logs — Look for AI bot visits
Manual testing — Query AI platforms and check citations
Structured data validators — Ensure markup is valid

Server Log Analysis

Look for these user agents in logs:

GPTBot/1.0
ClaudeBot/1.0
PerplexityBot/1.0

Common Issues

Issue	Symptom	Solution
Blocked crawlers	No AI platform traffic	Check robots.txt
JS-only content	Content not cited	Implement SSR
Slow responses	Incomplete crawling	Optimize performance
Invalid schema	Reduced visibility	Validate structured data