llms.txt is the trendy file of 2025. We've recommended it ourselves. But every week or two we'll audit a site whose owner has carefully crafted an llms.txt while their robots.txt is silently blocking the AI crawlers they want to attract. A short guide to why robots.txt still matters more, and how to make them work together.
Quick recap on what each does
- robots.txtis access control. It tells crawlers which paths they're allowed to fetch. Enforced by polite crawlers; ignored by abusers. The original spec from 1994 and still the most powerful file on your domain.
- llms.txt is citation guidance. It tells AI crawlers what your site is and how to cite you. Not access control. Proposed in late 2024, increasingly recognized by ChatGPT, Claude, and Perplexity.
Note: llms.txt is hint, robots.txt is rule. When they conflict, robots.txt wins for every crawler that respects either.
The trap we see most often
A surprising number of CMS defaults and security plugins block AI crawler user agents in robots.txt — often without the site owner realizing. Common culprits:
- WordPress security plugins that block "AI scraping bots" by default
- Cloudflare's "Block AI Scrapers" toggle, which adds rules to your effective robots.txt
- Custom robots.txt files copied from blog posts in early 2024 that recommended blocking GPTBot
- Migration accidents where the production robots.txt is the dev/staging robots.txt
If you've carefully written an llms.txt to court AI engines and your robots.txt blocks them, the llms.txt is decorative. The AI crawler never gets to read it.
The bots that matter
At the time of writing (mid-2025), the AI crawlers worth knowing by name:
GPTBot # OpenAI training crawler ChatGPT-User # OpenAI live-fetch (ChatGPT browsing) OAI-SearchBot # OpenAI SearchGPT ClaudeBot # Anthropic training crawler Claude-Web # Anthropic live-fetch PerplexityBot # Perplexity AI search Perplexity-User # Perplexity live-fetch Bytespider # ByteDance/TikTok AI training Google-Extended # Google's Bard/Gemini training opt-out flag CCBot # Common Crawl (used as training source by many)
A reasonable default robots.txt
For an SEO-focused site that wants to be cited but not used as training data:
User-agent: * Allow: / Disallow: /admin Disallow: /api/ # Allow live-fetch AI crawlers (they cite, don't train) User-agent: ChatGPT-User Allow: / User-agent: Claude-Web Allow: / User-agent: PerplexityBot Allow: / User-agent: OAI-SearchBot Allow: / # Block training crawlers (uncomment to enable) # User-agent: GPTBot # Disallow: / # # User-agent: ClaudeBot # Disallow: / # # User-agent: CCBot # Disallow: / Sitemap: https://yourdomain.com/sitemap.xml
Keep the training-bot block commented unless you have a strong reason. For most companies, being in training data is a long-term moat. If your content ends up in a model's parametric knowledge, you'll be referenced even without retrieval.
How to check yours
- Fetch your live robots.txt from a clean browser (not your CMS preview):
https://yourdomain.com/robots.txt. Confirm the contents are what you expect. - Test each AI crawler user-agent against your robots.txt with a parser. Google has one in Search Console; there are also free online ones.
- Check Cloudflare / Vercel / your CDNfor AI-blocking toggles. These don't modify your robots.txt file but they DO block at the edge, with the same effect.
Then, and only then, write your llms.txt
Once you've confirmed AI crawlers can actually reach your site, the llms.txt becomes useful. Don't skip the prerequisite. Our llms.txt guide covers the format.
Order of operations: robots.txt → confirm allowance → llms.txt → schema → content. Build from the ground up. The reverse is decoration on a closed door.