Ask HN: Should AI trained on publicly available data be banned?
The creators DALL-E and Github Copilot -- and people in favor of training AIs on publicly available data in general -- argue that people have been studying and making derivative work since the dawn of history, so there should be no problems training these models on public art and OSS software. There are many IP and copyright laws and precedent cases addressing whether a derivative work is allowed, and the images and code snippets these AIs generate seem to satisfy all of them. They are trained on publicly available sources, are not identical copy, are not clearly attributable to any original author (unless explicitly prompted to), and the AI model is incapable of having clear intention of copying when the work is being generated. These are all valid points, and certainly the lawyers at Microsoft and OpenAI can argue in favor of them better than me.
The problem is these laws were created when creating derivative work was not scalable.
Just a year ago to create a painting you have to spend years learning how to paint, studying the works of the old masters, making and improving your own works. Each new paintings take less time than the last, but the hours dedicated to create new work, either a masterpiece or a book cover, is never insignificant. To create a variation take almost as much time. To change to a new style you have to start the learning process all over again. It is hard, but you put your personal time and efforts to it. If your work was inspired by others, as long as it is not a blatant copy, the original authors will certainly feel appreciated, even if they may not like it.
Now all it takes is a text prompt and a few clicks, and you can create images in any style, from any author, with as many variations as you want. Code snippets generated by Copilot is not currently in the same league, but I think we can all agree that it is only a matter of time before a whole project can be created with a requirement prompt, certainly in the same style as the author or authors of the individual packages the AI was trained on.
The difference is industrial scale harvesting of creative efforts. Somebody admiring your works and spend time and efforts trying to learn from you is okay. If they become famous and make a lot of money you will be credited as their inspiration and at least an influence on their success. An AI blindly scanning and blending and making mass copies of altered works to make a corporation rich is not; in the corporation's eye the original authors, from the old masters to the modern amateurs, are all no different than unpaid slaves producing textiles in a mill somewhere to be fed into a machine to create mass produced clothing.
Sure all the training data are taken from publicly available sources with permissive license, but those licenses are created when we still think creative arts will be the last AI frontier, which wasn't event that long ago. If you could have foreseen that the license you chose back then would allow a corporation to profit from your work, would you have made the same choice?
So I would argue that from now on, training AIs on publicly available data must be banned, except where the work is on public domain or where the license explicitly allow the work to be fed as training data for AI models. If a law is not explicitly passed, current permissive licenses such as Creative Commons should be revised to include a clause addressing this point and let the author decide whether to let AIs train on their work or not.
I feel like having such a law or licensing convention established as soon as possible will greatly benefit humanity in the long term. In the short term we all want more free content, but once AIs start taking over all jobs now performed by human, creative arts will be the only source of meaning left for us in the future. The solution lies in denying authorship rights (ie., natural rights of authorship) to programs. This will prevent claiming copyrights on AI-generated content. A fundamental issue with granting authorship rights on AI/program generated content to the AI/programs is that what happens when two people use the same AI and generate two different works using similar prompts that also look similar? Are they each infringing on each other? This issue is unsettled. The case law on this is evolving in various jurisdictions. Case law from India: Aug 2021: India recognises AI as co-author of copyrighted artwork
https://www.managingip.com/article/2a5czmpwixyj23wyqct1c/exc... Dec 2021: Indian Copyright Office issues withdrawal notice to AI co-author
https://www.managingip.com/article/2a5d0jj2zjo7fajsjwwlc/exc... Regarding your last paragraph, it is possible that a new law or licensing convention will benefit some but I can not think of a law or licensing convention that currently exists that has greatly benefited humanity. In all the examples I can come up with, bureaucrats seem gain the most benefit by being the ones enforcing said regulations. As far as AI taking over jobs, that argument has been made many times before with each new tech. Think of the farriers and buggy whip manufacturers displaced from the automobile, or the typing pools that vanished with fax machines and later on email. More specific to your concerns; In the past, The printing press, photocopiers, cameras and many more were seen as a threat to the artist but the creative ones found a way to work with the new tech and did just fine. I think your heart is in the right place, maybe leaning a bit further to the left than me, but in any case I wouldn't be too worried about the creative arts, when art becomes common and it's value is diminished, the creative people find a new way! First side effect in mind is Google. It scrapes public data and uses some AI in search. But I think you might be on to something with derivative work, perhaps redefining derivative to include AI generated content. Shakespeare or Mona Lisa would be fine. It might allow training AI search engine spiders, but not training AI content generation. What if someone writes Harry Potter fan fiction but puts an open license for that? AI might not be trained on Harry Potter itself, but it could be trained on millions of semi-legal fan fiction. That's frequently the source of X in the style of Y prompts. I think banning not a great option. Banning will curb the creativity of the thinker. The problem is the we need to provide a mechanism which will enable the contributors in monetary and give a form of recognition too. We need to find a way to enable benefits to all the data contributors. I find the Gitcoin way of developing a product as great way to engage community and enable folks with a equity enabled token .
Something similar should happen with data providers too. Foundations, NGO or public dataset orgs can just sell the token in the market to people who want to use the AI and benefit from it. I would be against this. This would push monetization into every little and trivial thing and the net has negatively developed by that. There are some exceptions and it enabled certain content creators to really take off, but it had downsides too. Theoretically you could crawl code on github or elsewhere and develop your own AI. Monetization would put that behind a wall. Think Youtube hobby videos getting banned for music playing in the background as soon a people monetize their videos and DMCA became widely enforced. It is too abstract a problem for most people to understand the value of data because they just look at the presumably uninteresting data they share. But you can consent to such usages of course. I don't want foundations or NGO to own the data neither, this would be exceptionally horrible. Perhaps because I have horrible foundations in mind right now, but there is a lot of bullshit in this space. Overall it would be a worse situation than the status quo. I agree that banning isn't an option too. But a mechanism to withdraw consent of your data being used for AI training seems reasonable. Gitcoin probably would work for developers, who are highly paid already and their code have high potential of being used in profitable projects. Not so well for artists who earn a fraction of developers' salary and their arts are already being treated like commodity. I am not sure if it will work but worth experimenting. Probably you will have a certain limitations on free usage before the tokens. I feel there has be an easy way for to earn digital rights, digital equity , digital money for the artistic minds in this digital era which is proliferating very fast. I think this would have some significant unintended side effects. For example, is there any real difference in creating an index of millions of pictures and using ML to combine them into a new picture to creating an index of millions of websites and using ML to combine them in to a search results page? Without being very careful about how you regulate the use of 'public data' you could end up accidentally killing the internet. The differences I can see are these: With websites you can set a robot.txt file to opt out of indexing, and if they keep indexing anyway there are mechanisms to block it, both technical (i.e. ip restrictions) and legal (you can sue the indexer). The average website owners are also more technical and can understand the benefits and mechanisms, so presumably if you let your site gets indexed it's because you want to. And even so, what you want is for the search result page to drive traffic back to your site, not so that another site can be generated somewhere else and profited off by other. With images, the artists are generally not technical and savvy enough to know how to opt out, there are currently no mechanism to opt out of indexing anyway once the image is uploaded to internet, the generated images do not increase awareness of the artist or drive traffic to their websites in any way (if they even have a website), plus the explicit goals of the resulting images are to make money for the model owner and the prompter.