Major News Publishers Block AI Bots: Impacts on Training and Content Retrieval
Most Major News Publishers Block AI Training & Retrieval Bots
Major news publishers are increasingly restricting AI access, with 79% of top news sites blocking AI training bots while 71% also block retrieval bots used for real-time citations, according to a new BuzzStream analysis of 100 leading news sites across the US and UK.
This widespread blocking affects not only how AI models are trained but also whether these publishers appear in AI-generated answers when users ask questions. The distinction between training and retrieval bots creates a situation where news sources may be absent from AI citations even if their content informed the underlying models. This represents one of many significant risks and challenges of artificial intelligence in business contexts, particularly for companies relying on AI for information retrieval.
On this page:
How publishers are restricting AI access
BuzzStream's analysis revealed significant patterns in how news organizations manage AI bot access through their robots.txt files. These text files serve as directives to tell crawlers which parts of a website they can and cannot access.
Among training bots, Common Crawl's CCBot faced the highest rejection rate at 75%, followed closely by Anthropic-ai at 72%, ClaudeBot at 69%, and OpenAI's GPTBot at 62%. Google-Extended, which trains the Gemini model, had the lowest block rate at 46% overall, though with a striking regional difference: 58% of US publishers blocked it compared to just 29% of UK publishers.
Harry Clarkson-Bennett, SEO Director at The Telegraph, explained the publishers' reasoning: "Publishers are blocking AI bots using the robots.txt because there's almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive."
The blocking extends beyond training bots to those used for retrieving information during user queries. Claude-Web was blocked by 66% of sites, while OpenAI's OAI-SearchBot (powering ChatGPT's live search) faced restrictions from 49% of publishers. ChatGPT-User was blocked by 40%, and Perplexity-User had the lowest blocking rate at 17%.
The enforcement challenge
A critical limitation of these blocking strategies is that robots.txt files function as polite requests rather than technical barriers. As the report acknowledges, bots can simply ignore these directives if programmed to do so.
This enforcement gap was highlighted when Cloudflare documented Perplexity using what they described as stealth crawling behaviors to circumvent robots.txt restrictions. According to Cloudflare, Perplexity rotated IP addresses, changed ASNs, and disguised its user agent to appear as a regular browser—tactics that led Cloudflare to delist it as a verified bot and actively block it. Perplexity has disputed these claims in a published response.
Clarkson-Bennett reinforced this concern in the BuzzStream report: "The robots.txt file is a directive. It's like a sign that says please keep out, but doesn't stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives."
For publishers determined to prevent AI systems from accessing their content, more robust measures like CDN-level blocking or sophisticated bot fingerprinting techniques may be necessary beyond simple robots.txt instructions. This technical arms race demonstrates the complex relationship between what artificial intelligence truly means for content creators and distributors in the digital economy.
Technical countermeasures emerging
More sophisticated publishers are now implementing additional technical safeguards beyond robots.txt, including:
- JavaScript challenges that bots must solve before accessing content
- Rate limiting measures that detect and block excessive crawling behavior
- Content fingerprinting to identify when material has been scraped despite prohibitions
- Dynamic content rendering that makes automated extraction more difficult
According to a recent MIT Technology Review report, these advanced protective measures are becoming increasingly necessary as AI training operations grow more sophisticated in their data collection approaches.
Implications for publishers and readers
The high percentage of publishers blocking retrieval bots is particularly significant. While training blocks affect future AI models, retrieval blocks impact whether content appears in AI answers right now. This creates a paradoxical situation where publishers' content might have already been incorporated into AI models during training, but the same publishers won't be cited when users ask related questions.
OpenAI and other AI companies have established distinct crawlers for different functions—GPTBot gathers training data, while OAI-SearchBot enables live search in ChatGPT. Similarly, Perplexity distinguishes between PerplexityBot for indexing and Perplexity-User for retrieval. Blocking one doesn't automatically block the other.
The regional disparity in blocking Google-Extended—with US publishers nearly twice as likely to block it compared to UK counterparts—raises questions about different risk assessments regarding Gemini's growth or varying business relationships with Google across markets.
Economic considerations for publishers
Publishers face significant economic challenges when determining their AI access policies:
- Traffic diversion: AI-generated summaries may satisfy users without them clicking through to source sites
- Subscription impact: If premium content is summarized by AI tools, potential subscribers might find less value in paying
- Advertising revenue reduction: Fewer page views directly translates to lower ad impressions and revenue
- Content value dilution: When AI systems aggregate information from multiple sources, the distinctive value of individual publishers may diminish
These economic factors must be carefully balanced against the potential business benefits of artificial intelligence collaboration that could emerge through strategic partnerships with AI providers.
How this affects information access
For readers and information seekers, these blocking practices mean that AI tools may provide answers without citing major news sources, even when those sources would be the most authoritative or relevant. This could potentially create information gaps or reduce the diversity of perspectives represented in AI-generated responses.
Publishers face a difficult balancing act. By blocking AI training, they protect their content from being used without compensation. However, by also blocking retrieval, they risk becoming invisible in an increasingly AI-mediated information landscape.
Reader experience considerations
From the reader's perspective, several important consequences emerge:
- Information quality concerns: When major publishers are excluded from AI responses, answers may lack critical context or authoritative sources
- Citation gaps: AI systems may present information originally sourced from blocked publishers without proper attribution
- Reduced source diversity: Overreliance on unblocked sources could create echo chambers or bias in AI-generated responses
- Inconsistent knowledge access: The same query might yield substantially different results depending on which publishers block which AI systems
Looking forward
As AI systems continue to evolve, the tension between publishers seeking fair compensation for their content and AI companies needing diverse training material and citation sources will likely intensify. Cloudflare's Year in Review confirmed that GPTBot, ClaudeBot, and CCBot received the highest number of full disallow directives across top domains.
The report noted that most publishers use partial blocks for Googlebot and Bingbot rather than complete blocks, reflecting the dual role these crawlers play in traditional search indexing and AI training.
How readers can adapt to this changing landscape
- Verify information from multiple sources when using AI assistants
- Be aware that AI-generated answers may exclude major news publishers due to these blocking practices
- Consider going directly to news sites for comprehensive coverage of important topics
For publishers, the retrieval bot category represents the front line in the visibility battle. While training blocks might shape future AI models, retrieval blocks determine whether content appears in AI answers today, potentially affecting reader discovery and publisher relevance in an increasingly AI-driven information ecosystem.