Fighting bots is fighting humans

One advantage to working on freely-licensed projects for over a decade is that I was forced to grapple with this decision far before mass scraping for AI training.

In my personal view, option 1 is almost strictly better. Option 2 is never as simple as "only allow actual human beings access" because determining who's a human is hard. In practice, it means putting a barrier in front of the website that makes it harder for everyone to access it: gathering personal data, CAPTCHAs, paywalls, etc.

This is not to say a website owner shouldn't implement, say, DDoS protection (I do). It's simply to remind you that "only allow humans to access" is just not an achievable goal. Any attempt at limiting bot access will inevitably allow some bots through and prevent some humans from accessing the site, and it's about deciding where you want to set the cutoff. I fear that media outlets and other websites, in attempting to "protect" their material from AI scrapers, will go too far in the anti-human direction.

I guess there are only two options left:
  1. Accept the fact that some dickheads will do whatever they want because that’s just the world we live in
  2. Make everything private and only allow actual human beings access to our content
Bookmark by Ben Werdmuller on
I've been struggling with this. I'm not in favor of the 404 Media approach, which is to stick an auth wall in front of your content, forcing everyone to register before they can load your article. That isn't a great experience for anyone, and I don't think it's sustainable for a publisher in the long run. At the same time, I think it's fair to try and prevent some bot access at the moment. ...
Reply by Doug Jones on
Molly’s perspective really resonates with me. I like the comparison to open source software, where a freely licensed project could always be used by companies in for-profit products. However, what’s missing from the open web is some standards around licensing with regard to AI models. Just because something is free to read, doesn’t mean it’s free to use in any way, including ingestion into an LLM. ...
Have you responded to this post on your own site? Send a webmention! Note: Webmentions are moderated for anti-spam purposes, so they will not appear immediately.