The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious.
When you publish something to the web, you implicitly consent to it being scraped by bots. That is how the web has always worked and we have never seriously questioned it. AI training has taken advantage of that by breaking the social contract that made nonconsensual web scraping at least somewhat acceptable until now.
robots.txt was a standard created in 1994 to help build consent into web scraping. It is a file you can put in your project that tells bots not to crawl certain pages. Keep in mind though that it is a social contract i.e. entirely voluntary. There is no enforcement mechanism. Bots that want to ignore it simply do. The existence of robots.txt is less a consent mechanism and more an acknowledgment that bots were always going to crawl the web, and the best we could do was ask nicely. That is the foundation the modern web is built on: there is no opt-out for web scraping of your digital works.
This was tolerable for a long time because the scraping was, broadly, in service of humans finding and preserving the original content. Search engines crawled your website so people could discover it. Archivists mirrored pages so they would not disappear. The scrape and the visit were part of the same pipeline—taking your content was how people got directed to your content. It is still bad that consent could not be taken away, but it was not obviously predatory at the time.
That deal is no longer on the table. Large language models are trained on enormous quantities of text, code, and art—anything that can be scraped will be scraped. Once training is done, the source is no longer needed. The model does not point back to you, send traffic your way, preserve your original work with attribution, or honor the licenses that your work is bound by. It consumed your work completely and has no further use for you. The relationship between scraping and visiting has been severed, and the social contract that made nonconsensual scraping even somewhat tolerable was built on that relationship existing. This allows concentrated corporate power to extract collective labor and convert it into proprietary advantage, without contributing back to the commons. I saw the effects of this first hand on a LinkedIn post that I commented on where the people who contribute to the open culture of the web are deciding to close off their knowledge due to nonconsensual web scraping for AI training.
Copylefted code is a useful place to look at how badly the old assumptions have broken down. When you release code under a copyleft license like the GPL, you are not giving it away for any use. You are licensing it under specific terms: anyone can use/modify/share it, but derivatives must carry the same license guaranteeing the same user freedoms. It is a mechanism for ensuring that the commons of code as a public good keeps growing. Unfortunately, incorporating code into AI model weights does not legally count as redistribution. A model trained on GPL code can produce outputs that are not bound by the GPL. The mechanism that was supposed to ensure that open source stayed open has been broken by web scraping for AI training. Hong Minhee’s article “Is legal the same as legitimate: AI reimplementation and the erosion of copyleft” goes into greater depth on the ramifications and justifications associated with AI training on copylefted code.
This is not just a problem for free software developers. This blog is licensed under CC BY-SA 4.0. I chose that license because I believe knowledge sharing should be open and benefit the commons rather than privatized. The “ShareAlike” clause in this license is the mechanism that enforces that “benefit the commons” part—anything built on this work has to remain open too. But a bot can scrape this page and feed it into an AI model which will internalize the writing implicitly in its weights. It can then produce something nearly identical to the original work without giving back to the commons that the work was built on. The work I released to grow the commons gets absorbed into a closed system and hurts the very commons that the AI was trained on.
Another thing: the web is not a static archive. It depends on people continuing to create things. If releasing your work publicly means training the systems that continue to privatize knowledge, fewer things will be released publicly. For many people, publishing to the web carried an implicit context: this is for humans to read, find, use, and share. The models are being trained on a snapshot of a world where people had enough reasons to share their digital works openly. That world may no longer exist soon—as shown by the growing relevance of the Dead Internet Theory—creating a positive reinforcement loop in which AI will be increasingly trained on its own “slop”.
To be clear, this is not an argument that AI models should not be trained on web data. Some people genuinely do not mind. Some actively want their work ingested by AI models. Some might want their work to be ingested as long as they receive some sort of attribution, compensation, and their licenses honored. Many creators want their works to remain freely accessible to humans, but also want protection or compensation when automated systems extract value at scale. These are completely valid positions to hold about your own work. The problem is that there is only one currently assumed position for everyone with no mechanism to hold a different one.
Although this post is about nonconsensual web scraping and not AI training particularly, I’ve been using the latter as an example repeatedly as it is arguably the starkest example of how much you can take advantage of the commons at scale by using nonconsensual web scraping. An example of a consent model that could be used for web scraping for AI training specifically is the condition that the model be open. How to define “open” in this context probably deserves its own post, but at a bare minimum, the model should have open weights and open training data. I would also like to see some guarantees that the model only be used to generate content that will be licensed with more open licenses so that it feeds back into the commons.
I have been thinking about what a real solution looks like, and I do not think it is one thing.
The technical piece is making nonconsensual web scraping meaningfully harder. It will never be perfect, but making it harder at least makes it less economical to bypass consent which would encourage web scrapers to actually seek out consent to scrape. One way to achieve this could be allowing humans the ability to access the content freely and easily, but bots would need to pay per crawl via HTTP 402.
There also needs to be a legal piece to enforcing this. The technical approaches are useful precisely because the legal landscape has not caught up. Creating enforceable standards for automated access of digital works is not something new. DMCA Section 1201 makes it unlawful to circumvent technological measures that control access to copyrighted works. A new RFC code and DMCA amendment establishing a standard for how bots must identify themselves and request consent before scraping copyrighted content would immediately shift the norm of nonconsensual web scraping at the legal level.
None of this requires closing the web or abandoning the free and open culture that the web was built from. It requires the opposite: fixing the consent model before it erodes the motivation to have open culture at all. The web was built on the premise that sharing was worth the loss of control because the thing accessing your work was another person who might learn from it, build on it, link back to it, and keep the commons growing—or at least compensated you if they consensually privatized your works.
The web never asked whether any of this was okay. It assumed it was. That assumption deserves to be challenged, and it deserves to be challenged now—before the people who would have shared things publicly decide it is not worth it anymore.