The Web Scraping Consent Model Was Always Broken. AI Just Made It Obvious.
When you publish something to the web, you implicitly consent to it being scraped by bots. That is how the web has always worked and we have never seriously questioned it. AI training has taken advantage of that by breaking the social contract that made nonconsensual web scraping at least somewhat acceptable until now.
robots.txt was a standard created in 1994 to help build consent into web scraping. It is a file you can put in your project that tells bots not to crawl certain pages. Keep in mind though that it is a social contract i.e. entirely voluntary. There is no enforcement mechanism. Bots that want to ignore it simply do. The existence of robots.txt is less a consent mechanism and more an acknowledgment that bots were always going to crawl the web, and the best we could do was ask nicely. That is the foundation the modern web is built on: there is no opt-out for web scraping of your digital works.
This was tolerable for a long time because the scraping was, broadly, in service of humans finding and preserving the original content. Search engines crawled your website so people could discover it. Archivists mirrored pages so they would not disappear. The scrape and the visit were part of the same pipeline—taking your content was how people got directed to your content. It is still bad that consent could not be taken away, but it was not obviously predatory at the time.
That deal is no longer on the table. Large language models are trained on enormous quantities of text, code, and art—anything that can be scraped will be scraped. Once training is done, the source is no longer needed. The model does not point back to you, send traffic your way, preserve your original work with attribution, or honor the licenses that your work is bound by. It consumed your work completely and has no further use for you. The relationship between scraping and visiting has been severed, and the social contract that made nonconsensual scraping even somewhat tolerable was built on that relationship existing.
Copylefted code is a useful place to look at how badly the old assumptions have broken down. When you release code under a copyleft license like the GPL, you are not giving it away for any use. You are licensing it under specific terms: anyone can use/modify/share it, but derivatives must carry the same license guaranteeing the same user freedoms. It is a mechanism for ensuring that the commons of code as a public good keeps growing. Unfortunately, incorporating code into AI model weights does not legally count as redistribution. A model trained on GPL code can produce outputs that are not bound by the GPL. The mechanism that was supposed to ensure that open source stayed open has been broken by web scraping for AI training. Hong Minhee’s article “Is legal the same as legitimate: AI reimplementation and the erosion of copyleft” goes into greater depth on the ramifications and justifications associated with AI training on copylefted code.
This is not just a problem for free software developers. This blog is licensed under CC BY-SA 4.0. I chose that license because I believe knowledge sharing should be open and benefit the commons rather than privatized. The “ShareAlike” clause in this license is the mechanism that enforces that “benefit the commons” part—anything built on this work has to remain open too. But a bot can scrape this page and feed it into an AI model which will internalize the writing implicitly in its weights. It can then produce something nearly identical to the original work without giving back to the commons that the work was built on. The work I released to grow the commons gets absorbed into a closed system and hurts the very commons that the AI was trained on.
Another thing: the web is not a static archive. It depends on people continuing to create things. If releasing your work publicly means training the systems that continue to privatize knowledge, fewer things will be released publicly. For many people, publishing to the web carried an implicit context: this is for humans to read, find, use, and share. The models are being trained on a snapshot of a world where people had enough reasons to share their digital works openly. That world may no longer exist soon—as shown by the growing relevance of the Dead Internet Theory—creating a positive reinforcement loop in which AI will be increasingly trained on its own “slop”.
To be clear, this is not an argument that AI models should not be trained on web data. Some people genuinely do not mind. Some actively want their work ingested by AI models. Some people might want their work to be ingested as long as they receive some sort of attribution, compensation, and their licenses honored. These are completely valid positions to hold about your own work. The problem is that there is only one currently assumed position for everyone with no mechanism to hold a different one.
Although this post is about nonconsensual web scraping and not AI training particularly, I’ve been using the latter as an example repeatedly as it is arguably the starkest example of how much you can take advantage of the commons at scale by using nonconsensual web scraping. An example of a consent model that could be used for web scraping for AI training specifically is the condition that the model be open. How to define “open” in this context probably deserves its own post, but at a bare minimum, the model should have open weights and open training data. I would also like to see some guarantees that the model only be used to generate content that will be licensed with more open licenses so that it feeds back into the commons.
I have been thinking about what a real solution looks like, and I do not think it is one thing.
The technical piece is making nonconsensual web scraping meaningfully harder. It will never be perfect, but making it harder at least makes it less economical to bypass consent which would encourage web scrapers to actually seek out consent to scrape. A solution could be building private, anonymized human verification into the infrastructure of content distribution. A human would be able to access the content freely, but bots would need to seek consent, which could be given by paying per crawl, denied outright, or some other condition.
Here’s what this technical piece could look like:
Human-Gated Repository & Content Distribution Protocol
A human-first system for distributing digital works while requiring consent for automated extraction and enabling fair compensation.
Overview
I propose building a self-hostable, human-first repository hosting platform focused on protecting digital works from large-scale automated extraction while keeping access free and open for humans. The platform will likely use git for repository distribution and be built on top of a git hosting engine like Forgejo, extended with additional bot detection and private, anonymized human-verification layers. People and organizations can operate their own hosted instance, which can be monetized to support sustainability. Monetization can come from creators paying per unit of data stored in the instance and/or by taking a cut from pay-per-crawl described later in this document.
Although the system will initially focus on hosting code repositories, the broader goal is to support all forms of digital works: art, datasets, media, blogs, and more.
The platform integrates bot detection, rate limiting, and private, anonymized human verification directly into the hosting layer, rather than relying on un-enforceable conventions like robots.txt. The goal is not perfect prevention, but meaningful friction against automated scraping at scale. Making nonconsensual bot scraping un-economical will achieve the same effect as perfect prevention.
Purpose
Protect human knowledge from extractive automation
-
Openly shared creative works are increasingly absorbed into closed AI systems without meaningful reciprocity. This allows concentrated corporate power to extract collective labor and convert it into proprietary advantage, without contributing back to the commons.
-
This project aims to make nonconsensual automated mass extraction harder, slower, and economically unviable, while keeping human access open, free, and easy.
Enable fair compensation without restricting humans
-
Many creators want their work to remain freely accessible to humans, but also want protection or compensation when automated systems extract value at scale. This platform enables an optional pay-per-crawl API for bots, allowing creators to monetize automated access without restricting public human access.
-
I aim to create a de-facto standard for bots and AI companies to identify and ethically license content for ingestion. By providing a clear, machine-readable path to licensed access, we shift the norm from “scrape until blocked” to “seek permission and compensate.”
Preserve user freedoms and copylefted/ShareAlike works
-
Once a human obtains a work, they retain all rights granted by the license of said work. Access controls apply only to how content is obtained, not what users may do with it afterward.
-
If copylefted/ShareAlike works are ingested by an AI model, the model can generate outputs that are not legally bound by the original license. This is because the act of incorporating code into a model’s weights does not legally count as redistribution. Human-gated access helps ensure that copylefted/ShareAlike works continue to propagate freedoms in practice, rather than being exploited to produce non-copylefted/ShareAlike derivatives from AI models without contributing back to the commons.
Encourage sustainable open culture
- By reducing fear of uncompensated AI extraction, creators may feel safer releasing valuable work publicly for humans instead of retreating behind private distribution, restrictive licenses, or closed platforms.
High-level Technical Concept
Human-first access
- Humans only need to pass a verification when creating an account on the instance’s website. Verification must be private and anonymized. There should be technical measures to prove that data that could be used to identify the user are not shared with the instance.
- After an account is created, users may browse any digital works on the instance’s website through a browser.
- Accounts may have SSH public keys associated with them to allow users to
git (clone|pull|push)digital works that live on the instance.
Built-in bot resistance
- Automated scraping and bulk ingestion are actively discouraged through:
- CAPTCHAs
- Anubis
- Rate limiting
- Private, anonymized behavioral analysis
- Progressive challenges
- Abuse detection
- The system targets non-human behavior without meaningfully degrading normal human usage.
Pay-per-crawl
- Bots may optionally request structured access via a paid interface.
- Pay-per-crawl will be enforced through HTTP 402.
Initial Product Direction
The initial product will be a git hosting platform with built-in bot detection and human verification.
This provides:
- A realistic and well-respected deployment surface (
gitis a very mature piece of software and provides credibility to the protocol). - Immediate usefulness to developers and open-source communities.
- A strong testbed for adversarial scraping behavior.
- A natural on-ramp for hosting many types of digital works as
gitis not specific to code.
Over time, the system may expand beyond git-specific workflows, but git provides the fastest path to development and adoption.
There also needs to be a legal piece to enforcing this. The technical approaches are useful precisely because the legal landscape has not caught up. Creating enforceable standards for automated access of digital works is not something new. DMCA Section 1201 makes it unlawful to circumvent technological measures that control access to copyrighted works. A new RFC code and DMCA amendment establishing a standard for how bots must identify themselves and request consent before scraping copyrighted content would immediately shift the norm of nonconsensual web scraping at the legal level.
None of this requires closing the web or abandoning the free and open culture that the web was built from. It requires the opposite: fixing the consent model before it erodes the motivation to have open culture at all. The web was built on the premise that sharing was worth the loss of control because the thing accessing your work was another person who might learn from it, build on it, link back to it, and keep the commons growing—or at least compensated you if they consensually privatized your works.
The web never asked whether any of this was okay. It assumed it was. That assumption deserves to be challenged, and it deserves to be challenged now—before the people who would have shared things publicly decide it is not worth it anymore.