TL;DR
AI crawler management means deciding, per-bot, whether to allow, block, or selectively permit each AI company's crawler to fetch your content. The seven crawlers that matter most in 2026 are GPTBot, OAI-SearchBot, and ChatGPT-User (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Google AI), and Applebot-Extended (Apple Intelligence). Each has a distinct purpose — training, search, or on-demand fetching — and the right access policy depends on whether your content is a competitive moat or a marketing surface. This guide gives you ready-to-paste robots.txt blocks, a decision framework, server-side enforcement patterns, and a verification checklist. Default recommendation for marketing sites: allow search and on-demand bots, block training-only bots, document your stance in a public AI usage policy.
AI crawler management is no longer optional in 2026. Every site owner needs an intentional answer to one question: which AI bots are allowed to fetch your content, and for what purpose?
Three years ago the only crawler decisions that mattered were Googlebot and Bingbot. As of mid-2026, a working AI crawler policy has to address at least seven distinct user agents from five companies — and each one behaves differently. Block the wrong bot and you disappear from ChatGPT search citations. Allow the wrong bot and your premium content trains a competitor's model.
This guide walks through the full AI crawler robots.txt landscape: every major bot, its purpose, the exact directives that work, and the trade-offs of blocking versus allowing. It pairs with our llms.txt guide (curation for AI) and our agentic AI optimization playbook (preparing for AI agents that act on your site).
I run Verlua, a web development studio that ships AI-aware crawler configuration with every new build. The patterns below are what we deploy on client sites — not theory, but the exact blocks that pass current verification tests in May 2026.
The Seven AI Crawlers That Matter in 2026
Before writing a single rule, you need a clear inventory of what each crawler does. Bot names sound similar, but their behavior — and the cost of getting the rule wrong — is not.
GPTBot (OpenAI Training)
GPTBot is the dedicated training crawler from OpenAI, launched in August 2023 and documented at platform.openai.com/docs/bots. It fetches publicly accessible content and feeds it into the pipeline that trains future GPT model weights. Blocking GPTBot does not affect ChatGPT search citations because OpenAI uses a separate user agent for that.
OAI-SearchBot (ChatGPT Search)
OAI-SearchBot is the user agent that powers ChatGPT search results and citations. When a ChatGPT user runs a web-enabled query, this crawler fetches and indexes pages in real time, then the citation appears in the chat response. If your priority is showing up inside ChatGPT answers, this is the bot you must allow.
ChatGPT-User (Live Browsing)
ChatGPT-User fires when a single ChatGPT user clicks through to your page or asks ChatGPT to browse a specific URL. It is closer to a human visit than a crawl — a one-off fetch driven by an explicit user request. Most operators allow it because blocking it breaks the user's ability to share or read your page inside ChatGPT.
ClaudeBot (Anthropic)
ClaudeBot is Anthropic's general-purpose crawler. Unlike OpenAI's split-agent model, Anthropic has historically used a single user agent for both training and Claude product features such as projects and citations. As of 2026 Anthropic documents the bot at docs.anthropic.com/en/docs/agents-and-tools/claude-code. You cannot currently block training separately from citation surface for Claude.
PerplexityBot (Perplexity Index)
PerplexityBot crawls and indexes content that powers Perplexity's answer engine. Perplexity also operates Perplexity-User for live, user-initiated fetches. Blocking PerplexityBot removes you from Perplexity citations almost entirely. Perplexity publishes IP ranges and documentation at docs.perplexity.ai/guides/bots.
Google-Extended (Google AI Training)
Google-Extended is Google's opt-out user agent for AI training. Adding Disallow: / for Google-Extended stops Gemini, Vertex AI, and other Google AI products from training on your content. It does not affect Googlebot or Google AI Overviews — those still index and surface your pages through standard Google Search. Google documents the bot at developers.google.com/search/docs/crawling-indexing/overview-google-crawlers.
Applebot-Extended (Apple Intelligence)
Applebot-Extended is Apple's training opt-out, separate from the regular Applebot that powers Siri search and Spotlight. It was announced in mid-2024 and documented at support.apple.com/en-us/119829. Apple Intelligence and the on-device foundation models do not train on content where Applebot-Extended is disallowed.
The AI Crawler Decision Framework
Whether to block or allow a given bot depends on three questions about your content and business model. Run each crawler through this framework before writing rules.
- 1. Is the content a competitive moat or a marketing surface? Original research, copyrighted writing, paid courses, and proprietary datasets are moats — blocking training crawlers protects them. Service pages, blog posts, and product descriptions are marketing surfaces — being visible inside AI answers is more valuable than the marginal cost of training.
- 2. Does the bot directly generate visibility? Search and on-demand bots (OAI-SearchBot, PerplexityBot, ChatGPT-User) create direct citation surface. Blocking them removes you from AI answers. Training bots (GPTBot, ClaudeBot, Google-Extended) do not create direct citations — their value is long-term presence in model knowledge.
- 3. Is there a separate training and search agent? OpenAI and Google split training from search. Anthropic does not. The decision space is finer-grained for OpenAI and Google.
Default Recommendations by Site Type
- Marketing sites, agencies, local services: Allow everything. Maximum AI visibility, minimum competitive risk. The training data is your service descriptions and FAQs — not a moat.
- SaaS and product sites: Allow search and on-demand bots. Allow training bots for product pages and docs. Consider blocking training for proprietary changelog or pricing-strategy posts.
- Original research and journalism: Allow search bots, block training bots. Your research is the moat. Citation surface is fine — training someone else's model on your investigation is not.
- Premium content and paid courses: Block training, allow search for preview pages only. Use HTTP-auth or paywall for paid content so crawlers never see it.
- Membership communities and forums: Block all AI bots from user-generated content unless members opt in. Respect contributor intent.
AI Crawler robots.txt Recipes (Copy and Paste)
Below are four ready-to-paste robots.txt blocks that cover the most common policies. Use the one that matches your site type, then add your existing Googlebot, Bingbot, and sitemap rules at the bottom of the file.
Recipe 1: Allow Everything (Maximum AI Visibility)
Best for marketing sites, agencies, local services, and anyone whose primary content goal is being cited inside AI answers. No explicit allow rules are needed — robots.txt defaults to allow if no Disallow exists.
# robots.txt — Allow all AI crawlers User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / # Your existing rules User-agent: * Allow: / Sitemap: https://www.yoursite.com/sitemap.xml
Recipe 2: Block Training, Allow Search (Most Common 2026 Setup)
Best for SaaS, B2B content sites, and publishers who want AI citations but do not want their content training future models. This blocks GPTBot, ClaudeBot, Google-Extended, and Applebot-Extended while leaving search and on-demand bots open.
# robots.txt — Block training, allow search User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: * Allow: / Sitemap: https://www.yoursite.com/sitemap.xml
Note that ClaudeBot serves both training and Claude product features. Blocking it removes you from Claude citations as well as training. If Claude citation surface matters to your business, leave ClaudeBot allowed and accept the training trade-off.
Recipe 3: Block Everything (Maximum Protection)
Best for paywalled publishers, original-research sites, and creators whose business model depends on content scarcity. This blocks all AI crawlers known as of mid-2026.
# robots.txt — Block all AI crawlers User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-User Disallow: / User-agent: Anthropic-AI Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Perplexity-User Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: CCBot Disallow: / User-agent: * Allow: / Sitemap: https://www.yoursite.com/sitemap.xml
Recipe 4: Selective Block (Per-Directory)
Use this when you want AI crawlers to read marketing pages but never your premium content, customer dashboard, or research archive. The pattern mixes per-bot and per-directory rules.
# robots.txt — Selective per-directory block User-agent: GPTBot Disallow: /premium/ Disallow: /research/ Disallow: /members/ Allow: / User-agent: ClaudeBot Disallow: /premium/ Disallow: /research/ Disallow: /members/ Allow: / User-agent: Google-Extended Disallow: /premium/ Disallow: /research/ Disallow: /members/ Allow: / User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: / User-agent: * Allow: / Sitemap: https://www.yoursite.com/sitemap.xml
Pro Tip: Order Matters Less Than You Think
robots.txt parsers usually evaluate per-user-agent blocks independently. The order of the blocks does not change the outcome, but readability matters. Group blocks by company (all OpenAI agents together, all Anthropic agents together) so the editorial intent is obvious to future maintainers and auditors.
Need an AI crawler audit on your existing site?
Verlua audits robots.txt, server logs, and AI crawler access policies as part of every technical SEO engagement. We will tell you which bots are reaching your site, which are being blocked, and what should change.
Request a Free AI Crawler AuditImplementing AI Crawler Rules in Next.js, WordPress, and Webflow
The robots.txt content is the same regardless of platform — the difference is how you generate and serve it. Below are platform-specific patterns.
Next.js (App Router)
Next.js 13+ supports a typed robots configuration at app/robots.ts that emits a robots.txt at build time. The advantage is type safety and version control. The downside is verbosity for many user agents — most teams switch to a raw file in public/robots.txt once the rule set grows past five blocks.
// app/robots.ts
import { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{ userAgent: 'GPTBot', disallow: '/' },
{ userAgent: 'ClaudeBot', disallow: '/' },
{ userAgent: 'Google-Extended', disallow: '/' },
{ userAgent: 'Applebot-Extended', disallow: '/' },
{ userAgent: 'CCBot', disallow: '/' },
{ userAgent: 'OAI-SearchBot', allow: '/' },
{ userAgent: 'ChatGPT-User', allow: '/' },
{ userAgent: 'PerplexityBot', allow: '/' },
{ userAgent: 'Perplexity-User', allow: '/' },
{ userAgent: '*', allow: '/' },
],
sitemap: 'https://www.yoursite.com/sitemap.xml',
};
}WordPress
Yoast, Rank Math, and AIOSEO all expose a robots.txt editor in their UI. For sites without an SEO plugin, edit the file directly via FTP or the cPanel file manager. WordPress generates a virtual robots.txt by default if no physical file exists — creating a physical file overrides the virtual one. Always test after publishing because cached responses can mask broken rules for hours.
Webflow, Framer, Squarespace, Shopify
Webflow exposes robots.txt under Site Settings > SEO. Framer offers an SEO panel in project settings. Squarespace and Shopify auto-generate robots.txt and historically did not allow full editing — Shopify added customization in 2023 via theme files (robots.txt.liquid). Confirm by fetching yoursite.com/robots.txt with curl and reading the actual response.
Server-Side Enforcement (For Hard Blocks)
robots.txt is voluntary. Well-behaved crawlers respect it. Bad actors ignore it. For sites with sensitive content, layer in server-side blocks at the edge or origin: check the User-Agent header, return 403 for matching bots, and rate-limit by IP for known crawler ranges. Cloudflare, Vercel, and Fastly all expose middleware for this. Here is a sketch in Vercel Edge Middleware:
// middleware.ts — Hard block selected AI bots
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
const BLOCKED_AGENTS = [
'GPTBot',
'ClaudeBot',
'CCBot',
'Google-Extended',
];
export function middleware(req: NextRequest) {
const ua = req.headers.get('user-agent') || '';
if (BLOCKED_AGENTS.some(bot => ua.includes(bot))) {
return new NextResponse('Forbidden', { status: 403 });
}
return NextResponse.next();
}
export const config = {
matcher: '/((?!_next|favicon).*)',
};Pair the middleware block with the equivalent robots.txt Disallow so well-behaved bots stop on the polite signal and bad actors hit the 403. For tactical depth on the broader technical layer, see our technical SEO audit playbook and website security essentials.
How to Verify Your AI Crawler Rules Are Working
Pushing the file is half the job. Verification confirms the rules parse, the right bots are blocked, and the right bots are still allowed.
- 1. Fetch the live file with curl. Run
curl https://yoursite.com/robots.txtand confirm the response is plain text matching what you intended. A 404 or HTML response means the file is not being served correctly. - 2. Validate syntax with a parser. Google's robots.txt tester (in Search Console) parses your file the same way Googlebot does. For AI crawlers, manually inspect each User-agent block for typos — "GPT-Bot" with a hyphen is not the same as "GPTBot".
- 3. Check server logs for the bots you want to block. A week after publishing, grep your logs for the user agent strings. If GPTBot is still hitting pages where you disallowed it, the bot is either ignoring the rule (rare for OpenAI) or your rule has a syntax error.
- 4. Check server logs for the bots you want to allow. If OAI-SearchBot is not appearing in logs at all, your sitemap or internal linking may be the bottleneck, not the robots.txt.
- 5. Test AI search visibility. Run topical queries inside ChatGPT search and Perplexity. If your pages appear as citations, the search bots are working. If they do not appear, the issue is upstream of robots.txt — see the citation surface playbook.
- 6. Document the policy publicly. Publish an AI usage policy page that mirrors the robots.txt intent in plain English. This signals editorial position to readers, partners, and platforms, and matters if a dispute over training data ever arises.
Blocking Trends Across the Web in 2026
Adoption of AI crawler blocking has accelerated since 2024. Multiple analyses of the top one million sites have tracked the share of sites that explicitly block at least one major AI bot.
Two patterns stand out. First, GPTBot is the single most blocked AI bot — partly because OpenAI was the first major AI company to expose a documented crawler, giving site owners three years to make a decision. Second, search-only bots (OAI-SearchBot, PerplexityBot) are blocked at a fraction of the rate of training bots, which confirms that most operators want citation surface even when they want training-data protection.
Six Common AI Crawler Management Mistakes
- 1. Blocking GPTBot and assuming ChatGPT search is blocked too. They are separate agents. Blocking GPTBot has no effect on ChatGPT search citations. To remove yourself from ChatGPT search, block OAI-SearchBot.
- 2. Blocking Googlebot when you meant to block Google-Extended. The two names look similar but do very different things. Googlebot powers standard Google Search and AI Overviews. Google-Extended is the AI training opt-out. Blocking Googlebot is a catastrophic SEO move; blocking Google-Extended is a routine policy choice.
- 3. Typos in user agent strings. A robots.txt rule for "GPT-Bot", "gptbot", or "OpenAI-GPTBot" will not match OpenAI's actual crawler. The match is case-insensitive but spelling must be exact. Always copy the agent string from the company's official documentation.
- 4. Forgetting to update robots.txt when a new bot launches. The bot list has changed every quarter since 2023. Set a recurring calendar reminder to review robots.txt every 90 days, or automate it with a build-time check that fails CI when a new known bot appears without a rule.
- 5. Mixing allow and disallow without testing. Some parsers require the Allow rule before Disallow inside the same block. Others give precedence to the longest match. Test the actual rendered file with Google's tester and verify the parser matches your intent.
- 6. Treating robots.txt as enforcement. robots.txt is a request, not a wall. Bad actors and emerging research crawlers may ignore it. For genuine protection of premium content, use authentication, paywalls, or server-side user-agent blocking — not just robots.txt.
A Sacramento Law Firm Case Study
A short, real-world scenario to illustrate the decision framework in action. We worked with a Sacramento personal-injury law firm that had three concerns: they wanted AI citation surface for their service pages, they had a proprietary case-results database they considered competitive content, and they were unsure whether allowing GPTBot would put their content into a competitor's legal AI tool.
The configuration we shipped: allow OAI-SearchBot, PerplexityBot, ChatGPT-User, Claude-User, and Googlebot site-wide. Allow ClaudeBot on the marketing surface (service pages, blog, about). Disallow GPTBot site-wide. Disallow Google-Extended and Applebot-Extended site-wide. Disallow all AI crawlers on /case-results/ via a per-directory rule.
Six months in, the result is exactly what they wanted. ChatGPT search and Perplexity cite their service pages on personal-injury questions. Claude cites them inside Claude. The case-results database does not show up in any AI output we can find. The firm sleeps well at night because the editorial intent is documented in both robots.txt and a public AI usage policy. The same approach scales to most service-business websites that want presence without giving away the work product.
Where AI Crawler Management Is Heading
Three trends are reshaping the space in 2026 and are worth tracking as you design your policy.
- IETF AI Preferences (AI-PREFS). An IETF working group is drafting a standard for expressing AI training preferences via HTTP headers and well-known URIs, which would give site owners a more granular alternative to robots.txt. The draft is at datatracker.ietf.org/wg/aipref. Expect partial adoption in the next 12 to 18 months.
- Paid licensing markets. Cloudflare launched a Pay-Per-Crawl feature in 2025 that lets site owners require payment from AI crawlers in exchange for access. Reddit, the Associated Press, and the New York Times have signed direct licensing deals with OpenAI. The default for high-value content is shifting from binary block-or-allow to monetized access.
- Agentic AI crawlers. Browse-on-behalf-of-user bots (ChatGPT-User, Claude-User) are growing faster than traditional indexers. The on-demand category will dominate by late 2026, which is why our agentic AI optimization guide treats it as a separate discipline.
Frequently Asked Questions
How do I block AI crawlers from my website?
Add Disallow rules to your robots.txt for each AI crawler user agent. The four bots most operators block first are GPTBot (OpenAI training), ClaudeBot (Anthropic training), PerplexityBot (Perplexity indexing), and Google-Extended (Google AI training). The syntax is standard: a User-agent line followed by Disallow: /. The file lives at yoursite.com/robots.txt. Most CMS platforms expose a robots.txt editor in their SEO settings. For Next.js, edit app/robots.ts. After publishing, validate by fetching the file in a terminal with curl and checking that each user agent has its own rule block. Note that robots.txt is voluntary — well-behaved crawlers respect it, but bad actors ignore it. For hard blocking, layer in server-side checks for the user agent string and return a 403 status.
What is GPTBot and should I block it?
GPTBot is OpenAI's web crawler. OpenAI launched it in August 2023 and documents it at platform.openai.com/docs/bots. GPTBot fetches publicly accessible pages and uses the content to train future versions of GPT models. It is a separate user agent from OAI-SearchBot, which powers ChatGPT search and citations. Whether to block GPTBot depends on your goals. If your content is your competitive moat (paid courses, proprietary research, copyrighted writing), blocking GPTBot prevents free training data extraction. If you want AI visibility and the long-term presence of your brand inside model responses, allowing GPTBot is the more pragmatic choice. Most marketing sites allow it. Most paywalled publishers and original-research sites block it.
How do I allow AI crawlers but block training?
Use crawler-specific rules. OpenAI splits training and search into two user agents: block GPTBot (training) while allowing OAI-SearchBot (ChatGPT citations). Anthropic uses ClaudeBot for both training and Claude product features — there is no separation, so you cannot block training without blocking citations. Google offers Google-Extended specifically as the training opt-out: blocking Google-Extended stops Gemini and Vertex AI from training on your content while leaving Googlebot intact for search indexing. The pattern is: allow the indexing or search-citation user agent, disallow the training user agent. Document your stance in a public AI usage policy so your editorial intent is clear to readers and platforms.
Does blocking AI crawlers hurt SEO?
Blocking AI crawlers does not affect traditional Google search rankings. Google-Extended is a separate user agent from Googlebot, so blocking the AI training crawler leaves your standard SEO ranking signals untouched. Blocking GPTBot, ClaudeBot, and PerplexityBot has no effect on Google search rankings at all. However, blocking these crawlers reduces your visibility in AI-powered search experiences. ChatGPT search, Perplexity, Claude, and Google AI Overviews each weight content from their respective crawlers differently. If you block PerplexityBot, Perplexity stops citing your site. If you block OAI-SearchBot, ChatGPT search stops surfacing your pages. The decision is between training-data protection and AI-citation surface. Most marketing-focused sites should prioritize citation surface.
What is the difference between GPTBot and OAI-SearchBot?
GPTBot is OpenAI's training crawler. OAI-SearchBot is OpenAI's search crawler. GPTBot collects content to train future GPT model weights. OAI-SearchBot fetches pages in real time when a ChatGPT user runs a web-enabled query, then OpenAI surfaces and cites those pages in the response. Blocking GPTBot prevents your content from being used as training data. Blocking OAI-SearchBot prevents ChatGPT from citing your pages in live search answers. OpenAI also documents a third agent, ChatGPT-User, which fetches pages on behalf of a user during a conversation, similar to clicking a link. Allowing OAI-SearchBot and ChatGPT-User while blocking GPTBot is the standard configuration for sites that want AI-search visibility without training-data extraction.
How do I let ChatGPT cite my site but not train on it?
Block GPTBot and allow OAI-SearchBot in your robots.txt. The exact rules: User-agent: GPTBot / Disallow: / on one block, then User-agent: OAI-SearchBot / Allow: / on a second block. Add ChatGPT-User with Allow: / as a third block so users browsing inside ChatGPT can fetch your pages on demand. Publish these rules together so the editorial intent is unambiguous. Then verify with the OpenAI documented IP ranges and test in a few weeks by running ChatGPT search queries on topics you cover — if your pages appear as citations without your content showing up in raw model training output, the configuration is working. Note that historical training data already absorbed cannot be retracted by future robots.txt changes.
Build for AI Search, Not Against It
AI crawler management is the foundation for everything else in generative engine optimization. Block the wrong bot and your AI Overview visibility, ChatGPT citations, and Google AI Mode presence disappear in a single push. Verlua audits and configures AI crawler access on every site we build or rescue.
Founder & Technical Director
Mark Shvaya runs Verlua, a web design and development studio in Sacramento. He builds conversion-focused websites for service businesses, e-commerce brands, and SaaS companies.
California real estate broker, property manager, and founder of Verlua.
Stay Updated
Get the latest insights on web development, AI, and digital strategy delivered to your inbox.
No spam, unsubscribe anytime. We respect your privacy.