Listen Up, Publishers - Block the GPT Bot!

No content ingestion without consent!

Feb 13, 2024

Consent matters, folks.

There’s been a massive backlash against AI over the last year - and rightfully so. Here’s a smattering of both legal and opinionated blowback headlines and accompanying content against AI companies:

The CEO of News Corp, Robert Thompson, had a mighty fine quote regarding the issues swirling around OpenAI:

“Courtship is Preferable to Courtrooms” - Robert Thompson

I do want to preface the rest of this post by saying that I am, indeed, in the tech industry. I also (very often) leverage AI tools and services. There is a place for them to exist. But there is a very unhealthy and unethical relationship that these AI companies (OpenAI, Google, Midjourney, The Browser Company, etc.) have fostered.

What jumpstarted this article was a concerning release of a new product from The Browser Company coupled with an interview with their CEO, Josh Miller, on The Verge’s podcast (The Vergecast).

The product that was released was another mobile web browser (Arc Search). But in this circumstance, it’s not just another mobile web browser. Some concerning quotes from Josh Miller really wrap up my problems in a nutshell:

“My belief…anything that transforms our world…has very positive things that happen and very negative things that happen. Airbnb is one of my favorite products and experiences - it’s ruining a lot of cities. The same is true for almost everything that changes how we live our lives…for LLM…the same is true here…it’s absolutely true that Arc Search…is objectively good for the vast majority of people and it’s absolutely true it breaks something…the value exchange.“

“I think the answer is it will do more positive than negative…we need to find a way to get content creators and publishers paid.”

Josh Miller, CEO of The Browser Company

The problem is they don’t have an answer on how to pay content creators or publishers. They’re marking that as part of the product lifecycle for them to eventually figure out. They’re kicking the can while acknowledging it’s a problem.

This continues to be a problem with OpenAI, Midjourney, and others. Particularly they aren’t transparent whatsoever about their training data nor are they transparent about how both public and protected (see: requires a subscription, authentication, etc.) is regurgitated. They’ve employed a forgiveness over permission. And while the New York Times has both the legal and financial power to pushback, starving artists and other content creators don’t.

Now, back to Arc Search and The Browser Company. A big issue is the valuation of their company is presumably going up, but the traffic on the sites they steal content from is going down, the advertising dollars is dropping, and the only thing Arc Search is doing about it? Giving attribution. The problem is attribution is not consent. It’s not monetization. Here are the parts of the value prop they’re messing with here:

They’re not giving traffic back to the sites
They presumably index certain content meaning even their bots aren’t picked up multiple users request similar content
They’re not giving any money - no advertising dollars, no click throughs - zero!
They’re blocking ads by default

And Josh Miller is probably right - it’s better for users. Users want the information immediately, they don’t like ads, they don’t like scrolling past the page break to find their answer under sponsored content. But there has to be some model where content creators and publishers get a kickback every time an article is ingested and regurgitated.

But until there’s an answer, I strongly implore content creators to block the GPT bot:

From Ars Technica:

According to OpenAI's documentation, GPTBot will be identifiable by the user agent token "GPTBot," with its full string being "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)".
The OpenAI docs also give instructions about how to block GPTBot from crawling websites using the industry-standard robots.txt file, which is a text file that sits at the root directory of a website and instructs web crawlers (such as those used by search engines) not to index the site.
It's as easy as adding these two lines to a site's robots.txt file:
User-agent: GPTBot
Disallow: /
OpenAI also says that admins can restrict GPTBot from certain parts of the site in robots.txt with different tokens:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
Additionally, OpenAI has provided the specific IP address blocks from which the GPTBot will be operating, which could be blocked by firewalls as well

More can be found Ars: Sites scramble to block ChatGPT web crawler after instructions emerge

Now, the issue here is that this doesn’t fix content that OpenAI and companies that leverage a GPTBot have already scraped/stolen/trained/index/whatever. And it doesn’t fix other AI companies that don’t have a mechanism to prevent their bots from scraping/stealing/indexing content. But it’s a start.

As for Arc Search? I’d really like Josh Miller and/or their company to issue a way for content creators to at least protect their content from both being stolen and their monetization models from being zapped. And I have a feeling it won’t happen until there’s severe backlash from the content creation community.

Naming Conventions with Stephen Paul Adams

Discussion about this post