Google’s AI Training Defense: Fair Use Controversy and Future of Copyright Law
Google Defends AI Training as Fair Use in New Governance Paper
Google has staked out a firm position in the growing battle over AI and copyright law. In a policy paper published June 25, 2026, the tech giant argues that training AI models on publicly available web data constitutes a "transformative, non-expressive use" protected under U.S. fair use doctrine.
The paper arrives at a flashpoint moment for the AI industry. Publishers, regulators, and digital rights advocates are pushing back hard against how AI companies scrape and use web content — demanding clearer attribution, compensation, and in some cases, permission before any scraping takes place. Google's response is detailed but leaves significant questions unanswered.
On this page:
Google's Case for Fair Use
In its paper titled "A Pragmatic Approach to AI Governance in America," Google frames AI training on public web content much like an artist learning from observation. The company draws a direct comparison: training an AI model is like "an art student taking inspiration from walking through a gallery." The implication is that consuming publicly visible content for learning — rather than reproducing it — does not constitute infringement.
This legal framing matters enormously. Fair use in U.S. copyright law hinges on four factors: the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the original. Google's "transformative, non-expressive use" argument leans primarily on the first factor — positioning AI training as a fundamentally different activity from reproduction or distribution. Whether courts ultimately agree remains to be seen, and several active lawsuits are already testing these boundaries. For readers who want a deeper grounding in how AI introduces complex legal and operational risks for businesses, the stakes of this debate extend well beyond any single policy paper.
Google also argues that similar protections should extend internationally through text-and-data-mining exceptions, suggesting the company wants a global framework that aligns with its current practices.
For publishers and site owners who disagree, Google points to machine-readable opt-out controls as the primary remedy. Specifically, the paper recommends using the Google-Extended directive inside a site's robots.txt file. This allows webmasters to signal that their content should be excluded from AI training pipelines.
When AI outputs do reproduce existing work too closely, Google says the answer lies in existing notice-and-takedown processes rather than developing new filtering systems to evaluate output similarity. The paper does not propose new legal mechanisms or enforcement tools.
The Opt-Out Burden and Its Limitations
The opt-out model Google proposes places the responsibility squarely on rights holders to act — and to act correctly. A webmaster who is unaware of the Google-Extended directive, or who implements it incorrectly, may inadvertently leave their content exposed to training pipelines they never consented to. This is not a hypothetical concern; it is an operational reality for the thousands of smaller publishers who lack dedicated technical teams.
Critics argue this is a structural imbalance by design. Placing the compliance burden on content creators — rather than on the entity doing the scraping — keeps the default setting firmly in Google's favour.
Where Publishers and Regulators Push Back
Not everyone is satisfied with an opt-out model — and the opposition is growing louder.
Digital Content Next, a trade association representing major digital publishers, recently sent a cease and desist letter to the Common Crawl Foundation. The letter made its position unambiguous: "copyright law is not an opt-out regime." Publishers argue that scrapers must obtain permission before using content rather than placing the burden on rights holders to request exclusion after the fact.
This framing strikes at the core of Google's governance position. If courts or legislators side with the permission-first view, the opt-out framework Google is defending could face serious legal challenges.
Regulators are also moving. The UK's Competition and Markets Authority introduced a new conduct requirement this month giving websites the option to opt out of AI search features while also requiring Google to attribute publisher content. The CMA framed the measure as a tool to strengthen publishers' bargaining power against large AI platforms.
Google has begun testing an opt-out toggle in response but has not yet provided publishers with click-level data — information that would help rights holders make informed decisions about whether opting out is worth the potential traffic loss.
The Permission-First Standard Gaining Ground
The permission-first argument is not simply a legal technicality — it represents a fundamentally different model of how AI systems should be built. Under a permission-first standard, AI developers would need to establish licensing relationships before training on content, rather than after the fact. This would likely slow development timelines and increase costs, which is precisely why large AI platforms have resisted it.
However, the music industry's experience with digital streaming offers a cautionary parallel. Artists saw their catalogues widely distributed with minimal return until licensing structures were forced into clarity through litigation and regulatory pressure. Publishers appear determined not to repeat that experience, and the momentum behind permission-first frameworks suggests they may have the leverage to avoid it.
The broader ethical questions raised here are ones the technology sector has encountered before. The ethics of big data collection and use have long been contested territory, and AI training scraping sits at the intersection of data ethics, intellectual property law, and platform accountability.
Paid Partnerships, Compensation, and What Publishers Need Now
The Value Exchange Question
Google does acknowledge that opt-outs alone may not be sufficient for all scenarios. The paper references two emerging approaches for content that goes beyond standard public web scraping.
First, Google mentions partnerships with websites whose content helps keep AI responses current and accurate — a nod to grounding partnerships where fresh or authoritative content is particularly valuable. Second, the paper references paid deals for access to specialised non-public content, suggesting that some publishers may negotiate direct compensation arrangements.
However, the paper provides no program names, payment terms, eligibility criteria, or timelines. These remain hypothetical value-exchange mechanisms rather than concrete commitments. As Search Engine Journal noted in its coverage, these are "policy positions, not product commitments."
For publishers hoping to understand what compensation might look like, the paper offers a direction without a destination. Whether Google links specific figures or formal programs to this language in future documents remains an open question worth monitoring closely.
The Traffic Erosion Problem
The broader context matters here. Since Google launched AI Overviews, the relationship between AI systems and the publishers whose content feeds them has grown increasingly strained. Some publishers report declining referral traffic even as their content continues to inform AI-generated answers — a dynamic that makes the compensation question feel urgent rather than theoretical.
This tension sits at the heart of a structural problem: AI systems are becoming more capable of delivering answers without delivering audiences. For publishers whose business models depend on traffic, this is an existential concern, not an abstract policy debate. Understanding how AI systems function and what they are actually doing with web content is increasingly essential for anyone managing a content-dependent business.
What This Means for Publishers and Digital Professionals
Google's governance paper offers a clear window into the company's regulatory strategy heading into what may be a defining period for AI policy in the U.S. and abroad. Here is how professionals can use this information.
Immediate Actions Worth Taking
-
Site owners and publishers should audit their robots.txt files now to ensure Google-Extended directives reflect their current preferences on AI training. Waiting for regulatory clarity before acting means potentially months of uncontrolled data use.
-
Digital marketers and SEO professionals should track whether Google begins publishing click-level reporting tied to AI features. That data will be essential for measuring the real traffic impact of opting in or out of AI search experiences.
-
Businesses with proprietary or specialised content should assess whether their assets qualify for the kind of non-public content licensing deals Google references — and consult legal counsel before assuming fair use applies in both directions.
Watching the Regulatory Horizon
The next few months will likely determine whether Google's opt-out framework survives regulatory scrutiny or whether a permission-first standard takes hold. Either outcome will reshape the economics of content creation online.
The organisations best positioned to navigate this shift will be those that treat it as a strategic issue now, rather than a compliance problem later. Monitoring decisions from the UK's CMA, active U.S. litigation involving AI training data, and any formal licensing announcements from major AI platforms will be essential signals in the months ahead.
For an authoritative and continuously updated reference on AI governance and copyright developments, the U.S. Copyright Office's AI policy resources provide primary source documentation directly relevant to these debates.