llms.txt is the trendy file of 2025. We've recommended it ourselves. But every week or two we'll audit a site whose owner has carefully crafted an llms.txt while their robots.txt is silently blocking the AI crawlers they want to attract. A short guide to why robots.txt still matters more, and how to make them work together.

Quick recap on what each does

  • robots.txtis access control. It tells crawlers which paths they're allowed to fetch. Enforced by polite crawlers; ignored by abusers. The original spec from 1994 and still the most powerful file on your domain.
  • llms.txt is citation guidance. It tells AI crawlers what your site is and how to cite you. Not access control. Proposed in late 2024, increasingly recognized by ChatGPT, Claude, and Perplexity.

Note: llms.txt is hint, robots.txt is rule. When they conflict, robots.txt wins for every crawler that respects either.

The trap we see most often

A surprising number of CMS defaults and security plugins block AI crawler user agents in robots.txt — often without the site owner realizing. Common culprits:

  • WordPress security plugins that block "AI scraping bots" by default
  • Cloudflare's "Block AI Scrapers" toggle, which adds rules to your effective robots.txt
  • Custom robots.txt files copied from blog posts in early 2024 that recommended blocking GPTBot
  • Migration accidents where the production robots.txt is the dev/staging robots.txt

If you've carefully written an llms.txt to court AI engines and your robots.txt blocks them, the llms.txt is decorative. The AI crawler never gets to read it.

The bots that matter

At the time of writing (mid-2025), the AI crawlers worth knowing by name:

GPTBot           # OpenAI training crawler
ChatGPT-User     # OpenAI live-fetch (ChatGPT browsing)
OAI-SearchBot    # OpenAI SearchGPT
ClaudeBot        # Anthropic training crawler
Claude-Web       # Anthropic live-fetch
PerplexityBot    # Perplexity AI search
Perplexity-User  # Perplexity live-fetch
Bytespider       # ByteDance/TikTok AI training
Google-Extended  # Google's Bard/Gemini training opt-out flag
CCBot            # Common Crawl (used as training source by many)

A reasonable default robots.txt

For an SEO-focused site that wants to be cited but not used as training data:

User-agent: *
Allow: /
Disallow: /admin
Disallow: /api/

# Allow live-fetch AI crawlers (they cite, don't train)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block training crawlers (uncomment to enable)
# User-agent: GPTBot
# Disallow: /
#
# User-agent: ClaudeBot
# Disallow: /
#
# User-agent: CCBot
# Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

Keep the training-bot block commented unless you have a strong reason. For most companies, being in training data is a long-term moat. If your content ends up in a model's parametric knowledge, you'll be referenced even without retrieval.

How to check yours

  • Fetch your live robots.txt from a clean browser (not your CMS preview): https://yourdomain.com/robots.txt. Confirm the contents are what you expect.
  • Test each AI crawler user-agent against your robots.txt with a parser. Google has one in Search Console; there are also free online ones.
  • Check Cloudflare / Vercel / your CDNfor AI-blocking toggles. These don't modify your robots.txt file but they DO block at the edge, with the same effect.

Then, and only then, write your llms.txt

Once you've confirmed AI crawlers can actually reach your site, the llms.txt becomes useful. Don't skip the prerequisite. Our llms.txt guide covers the format.

Order of operations: robots.txt → confirm allowance → llms.txt → schema → content. Build from the ground up. The reverse is decoration on a closed door.