AI data contracts, content and search dilemmas

On July 1st, Reddit updated its Robot.txt files to exclude crawling of the site by anyone other than those with commercial agreements or for non-profit/education purposes. What are the lessons for content sites on the open web?

AI data contracts, content and search dilemmas

Finally, some longer posts coming (thank you summer break!).

On July 1st, Reddit updated its Robot.txt files to exclude crawling of the site by anyone other than those with commercial agreements or for non-profit/education purposes (the actual file is here).

The company says the update is not related to a recent contract with Google that allowed the use of Reddit data in training AI models. The announcement, though, clearly identifies an uptick in the crawling of Reddit by multiple commercial entities as a contributing factor in making the change. Those most likely culprits are AI crawlers, with OpenAI, Anthropic, and Perplexity. All have been accused of ignoring previous rules to harvest training data.

Reddit results still appear in a wide range of search engines, but it's unclear if this is just previously crawled data or if there are agreements in place.

Reddit and other content sites will probably continue to try to be friendly to normal search indexes. This does drive their site traffic, after all. What is also clear, though, is that they won't allow crawling for AI training outside of a commercial agreement.

Content sites, search, and AI

The question content sites have to answer is:

  • Do I want to be found by users? (Search)
  • Do I want my point of view or content to be represented? (AI)

Saying yes to the first question and no to the second seems like the logical choice if your business model involves monetizing content. The answer will start to get complicated, though, because AI search summaries may increasingly replace pure search. Google's Q2 earnings results seemed to show that the search business was robust and healthy, but Ben Thompson at Stratechery picked up the accelerating decline Google's network advertising revenue from -1% in Q1 to -5% in Q2 [Paywalled]. This part of Google's revenue comes from display advertising in apps (AdMob) and other publishers' sites (AdSense). Ben speculates that one reason is the lingering effects of Apple's ATT changes (AdMob), and the other may be that Google's AI-generated search answers are already sending less traffic to websites (AdSense).

If that is the reason, being "included in search results" might lead to diminishing gains over time, and referrals from AI results might be the only way to compensate for the lost traffic.

Even if this isn't happening yet, it seems inevitable that AI-generated results will start to be a meaningful share of search results on Google and elsewhere.

Why do you want search referrals anyway?

Most content sites don't have the clout of Reddit and are unlikely to be able to get AI model creators to pay a premium to access their content for training. That leaves them with the choice to either take relatively low-value deals, simply allow free access to content for training, or block training entirely. Which path a content site takes really comes down to why the content is published in the first place. The primary reasons are:

  1. Getting a point of view heard.
  2. Capturing traffic that can be monetized with advertising.
  3. Capturing traffic that can be monetized in some other way (e.g., selling goods, converting users to subscribers, etc.)

All of these are intertwined, but they are different.

Cultural and scientific sites may care more about their content simply being represented in AI training data.

Advertising-based sites will want not only transactions about actual user flow. They may be most hurt if AI search summaries become a meaningful part of search results. The referrals within AI search summaries will likely end up being paid partnerships, so this will look less like SEO and more like paid search. This is also the reason that Google's search business may not fall off a cliff: as long as the user habit is still to search with Google, Google can monetize that search. For the advertising-based content provider, though, paying increasing amounts to acquire traffic in order to then monetize it with advertising doesn't make a ton of sense.

Sites monetizing by selling goods, information, or services will likely be incentivized to provide as much content as possible for AI training. The more AI knows about your product, the more likely it is to surface it in an answer to a question. This looks like a much more complex version of SEO.

Exclusivity

As Ben Thompson highlights in his piece, it isn't public whether or not Reddit's AI training contract with Google is exclusive or not. An exclusive deal would mean that only Google would be able to train on Reddit data, and over time, Reddit could switch that exclusive right to the highest bidder. The implications of an exclusive deal would be very serious, however:

  • Reddit's "point of view" would be absent from AI models built by others.
  • Given the value of Reddit's data, other AI providers would be at a disadvantage in their model building.

We should certainly hope content deals are not exclusive since the world of AI models would end up as a patchwork of data deals with everyone using different baseline data. Exclusive deals would probably also get major anti-trust scrutiny (which is very much top of mind for Google at the moment - see below).

Even if deals are not exclusive, some of the data access may be prohibitively expensive for some AI model builders. Hopefully, data providers will seek models that enable cheaper access to smaller players; otherwise, we'll still see large players locking in data advantages.

The impact of robots.txt

For individual content publishers, from small sites to large publishers, there are difficult choices to make around what access to allow for free and what to try to monetize (or outright block). A few conclusions from the current state of play are:

  • It almost certainly still makes sense to allow crawling for search indexes, and search engine providers should be pushed to decouple AI-training-related agreements from AI-training agreements. "Allow me to train my AI on your content, or I won't index you" is a bullying tactic.
  • Google's search business may well thrive in an AI world. However, traffic to the rest of the open web from Google (and other search engines) will drop significantly. Content sites will be forced to consider whether they need to open up content for AI training in the hope that this becomes a new stream of visitors. Google's network revenue (Adsense, admob, etc.) may also decline further and would have to be made up from AI search.
  • Content sites that monetize via ads and rely on organic search traffic have particularly challenging times ahead.
  • If exclusive deals to provide training data to only a single model builder become common, they will lead to an extremely fragmented set of models and may also lead to antitrust challenges.

Hopefully, some reasonable common best practices will begin to emerge so that content providers don't have to each individually craft a strategy.

Google's TAC Antitrust loss

All these issues are further sharpened by the US district court's ruling last week that Google holds a monopoly over Internet search and is illegally perpetuating this Monopoly through its partnership deals with companies like Apple. This ruling is about the fact that Google pays a number of web browsers and platforms, including Apple, Samsung, Firefox, and others, for the privilege of being the default search engine for Web searches. The largest payment to Apple is reportedly around $20B per year.

Google will almost certainly appeal the ruling and may successfully overturn it.

However, if that does not happen, the ruling will have far-reaching consequences. The remedies the court imposes could vary widely, but they would almost certainly include not paying these fees anymore. For a time, Google will still be the default search, but it indeed could unleash competition:

Some of the beneficiaries are themselves large companies: is Apple adding its own search engine as a default, not "bundling"? Is having Microsoft Bing simply replace Google good for search engine innovation?

Even if some of the non-Google entities pay, it seems unlikely they will pay anything close to what Google was paying since they are able to monetize those searches much less well.

One thing the case will almost certainly do is make search engine and AI model makers cautious about "exclusive" content deals if they have any scale. They will want to make them exclusive in order to differentiate, but if they reach a certain scale, they will risk Google-like judgments against them. Overall, this may also depress the price of unique content. Hopefully, content providers can make it up in volume.

If you enjoyed this consider signing up for the regular newsletter here and/or share on your favorite social channels :-).