AI Robots & Crawlers

Managing AI bot access to your content is a critical part of AEO. Understanding how different AI crawlers work and properly configuring your robots.txt and llms.txt files can significantly impact your visibility.

The New Crawler Landscape

In the early years of search engines, many media companies locked out Google's indexing bots—a decision many came to regret. Today, the same pattern is emerging with AI crawlers.

Key question: Can you afford to stay invisible to AI?

Known AI Crawlers

Bot NameCompanyPurpose
GPTBotOpenAIChatGPT training and search
ChatGPT-UserOpenAIReal-time browsing
ClaudeBotAnthropicClaude training
PerplexityBotPerplexitySearch indexing
GooglebotGoogleIncluding Gemini/AI Overview
BingbotMicrosoftIncluding Copilot
CCBotCommon CrawlGeneral training data
ApplebotAppleSiri and AI features
cohere-aiCohereLLM training
FacebookBotMetaAI training

robots.txt Configuration

Allowing AI Crawlers

To maximize AI visibility, allow all major AI bots:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: CCBot
Allow: /

# Sitemap location
Sitemap: https://yourdomain.com/sitemap.xml

Selective Access

If you want to allow AI search but block training:

# Allow real-time search browsing
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

Protecting Specific Content

Block sensitive areas while allowing general access:

User-agent: GPTBot
Allow: /
Disallow: /private/
Disallow: /admin/
Disallow: /api/

User-agent: *
Allow: /
Disallow: /private/
Disallow: /admin/

The llms.txt Standard

llms.txt is a new standard specifically designed for LLM crawlers—similar to what robots.txt does for search engines.

Basic llms.txt Structure

# Site Information
site_name: Your Site Name
site_url: https://yourdomain.com
site_description: Brief description of your site

# Contact
contact_email: contact@yourdomain.com

# Entry Points
primary_entry: https://yourdomain.com
documentation_root: https://yourdomain.com/docs
api_documentation: https://yourdomain.com/api

# Sitemap
sitemap_url: https://yourdomain.com/sitemap.xml

# Sections
sections:
  - Getting Started: /docs/getting-started
  - API Reference: /api
  - Guides: /guides

llms.txt Best Practices

  1. Place at site root — /llms.txt
  2. Keep updated — Reflect current site structure
  3. Include key entry points — Help AI find important content
  4. Describe your site clearly — Help AI understand context

Technical Crawling Challenges for AI

JavaScript Rendering Issues

AI crawlers have limitations compared to traditional search crawlers:

What AI crawlers struggle with:

  • JavaScript-rendered content
  • Dynamic content loading
  • Single Page Application (SPA) content
  • Third-party scripts (like Google Tag Manager)
  • Client-side routing

The Problem

// This content may not be visible to AI crawlers
useEffect(() => {
  fetch('/api/content')
    .then(res => res.json())
    .then(data => setContent(data));
}, []);

Solutions

Server-Side Rendering (SSR):

// Next.js example - content rendered server-side
export async function getServerSideProps() {
  const content = await fetchContent();
  return { props: { content } };
}

Static Generation:

// Next.js example - content generated at build time
export async function getStaticProps() {
  const content = await fetchContent();
  return { props: { content } };
}

Progressive Enhancement:

  • Serve core content as HTML
  • Enhance with JavaScript
  • Ensure content works without JS

JSON-LD and AI Crawlers

JSON-LD structured data is increasingly important for AI visibility:

Why JSON-LD Matters

  • Provides clean, structured context
  • Helps AI understand entities and relationships
  • Works without JavaScript execution
  • Machine-readable format

Ensuring JSON-LD is Accessible

<!-- Include in HTML head - doesn't require JS -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Your Company",
  "url": "https://yourdomain.com"
}
</script>

Avoid: Injecting JSON-LD via JavaScript

// This may not be seen by AI crawlers
document.head.innerHTML += `
  <script type="application/ld+json">
    ${JSON.stringify(schema)}
  </script>
`;

Verifying Crawler Access

Testing Your Configuration

  1. Use robots.txt testers — Verify rules work as intended
  2. Check server logs — Look for AI bot visits
  3. Manual testing — Query AI platforms and check citations
  4. Structured data validators — Ensure markup is valid

Server Log Analysis

Look for these user agents in logs:

GPTBot/1.0
ClaudeBot/1.0
PerplexityBot/1.0

Common Issues

IssueSymptomSolution
Blocked crawlersNo AI platform trafficCheck robots.txt
JS-only contentContent not citedImplement SSR
Slow responsesIncomplete crawlingOptimize performance
Invalid schemaReduced visibilityValidate structured data

Content Delivery Considerations

CDN and Caching

Ensure AI crawlers receive cacheable, consistent responses:

Cache-Control: public, max-age=3600

Rate Limiting

Be careful not to rate-limit AI bots:

  • Whitelist known AI crawler IPs
  • Use appropriate rate limits
  • Monitor for crawl errors

Geographic Restrictions

If using geo-restrictions:

  • Consider crawler origin locations
  • Whitelist crawler IP ranges
  • Test from multiple locations

Future Considerations

The AI crawler landscape is evolving rapidly:

  • New bots emerge regularly — Monitor announcements
  • Standards are developing — llms.txt may evolve
  • Regulations vary — EU AI Act affects some crawlers
  • Best practices change — Stay current with developments

Action Checklist

  • Audit current robots.txt for AI bot rules
  • Create/update llms.txt file
  • Test JavaScript rendering accessibility
  • Verify JSON-LD is in static HTML
  • Check server logs for AI crawler visits
  • Monitor AI platform citations
  • Keep crawler rules updated

Next Steps

Was this page helpful?