Cessation of public development of Kefir C compiler
kefir.protopopov.lv> Yet, this shift made me re-evaluate the open source code publishing. Prior to that, I have been positive about free and open software, and considered this to be the default mode for work such as kefir. I did not require any justifications from myself to publish something. Now, however, I feel more and more that the main beneficiaries of my unpaid work are companies scraping the internet to train large language models. Currently accepted status quo in this area goes against my own intentions in licensing this work under GNU GPLv3. Publication has ceased to be the "null hypothesis" for me, and requires explicit mental justification which I am not able to provide.
I feel this pain, one of my small donation driven sites has been destroyed by crawlers who just ignore robots.txt and burn the site into the ground.
Sort of jokingly I proposed an update to the "spam fax" law:
This is essentially the digital world transforming from a high trust society into a low trust one. Sad to see.
I don't think the digital world was ever high trust. I mean, everyone above a certain age is trained to never click on the biggest download button on a page and to uncheck any checkboxes during an installation. A certain open source forge used to bundle malware in downloads. You can't walk two steps without hitting cloudflare. All email providers consider random VPS IP ranges to be spam farms. All web servers with public IPs must be up to date or you get pwned instantly and assimilated into a bot farm.
I can go on and on about how much safety measurements we take online since ages ago and how little trust we have for anything that comes through an Ethernet port. I have never needed such levels of vigilance in real life even though I live somewhere with higher crime rates compared to the national average.
Not even just digital; much of the world is shifting from high trust to low trust as well: https://social.desa.un.org/sites/default/files/inline-files/...
This is kind of crazy. The digital world has never been a "society" except perhaps for the first few years after ARPANET was invented, and it certainly hasn't been a high-trust one for almost as long - we've had spam filters, user account registration required to comment, various authentication methods, moderation, and various things you get in a low-trust environment for decades now. To think otherwise is a bit delusional.
There are currently a lot of people in the upper echelons of our society who repeatedly and vigorously abuse the high-trust digital world.
We based all of this on gentlemen's agreements and handshakes. That let quite a few people get only very wealthy, instead of hyper-wealthy. Thus those agreements have to be shredded.
AP mentions this in the link:
> Section 227(g)(4). Enforcement. Statutory damages of not less than $500 per server request made in violation of this section, consistent with the per-violation damages established under the original Act for unsolicited facsimile transmissions.
While this is at least something, it's not going to dissuade a startup from doing this sort of thing. They'll find ways to hide the origin of traffic, or just soak up the costs with more VC money.
You need to start throwing people in prison for long periods of time (10+ years) for this sort of thing to stick.
To whom would you attribute the greater part of that reduction in trust: the people using FOSS to train LLMs, or the people trying to block them?
People who break the social contract are the ones responsible for breaking the social contract, not the ones who take steps in response to social contract being broken.
So the questions here are (a) is any generally accepted social contract actually being broken, and (b) if so, who are the ones who are breaking it?
The contract behind open source was something like (GPL):
"If you copy my work, you should share your work too."
or at minimum (MIT):
"If you copy my work, you should credit me."
I think it is no longer under dispute that the legal contract is satisfied by LLMs. The AI companies won and will continue to win.
But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way. What did the authors mean by "copy"? Did they mean literally CTRL+C, CTRL+V or something broader?
This is a matter of opinion which only each individual creator can answer. For me, copying meant something like:
"To reproduce the function of my work, dependent on my having published it, without effort nor understanding of your own"
Ten years ago this basically required doing a CTRL+C, CTRL+V so there was no need to be more specific. Anybody who did enough work to, say, rewrite in another language (with that language's idioms), met the bar of clause 3. Now AI enables a form of "copying" that matches my definition, without the user even being aware of whose works they are copying. It perfectly launders the origins of its output. It can write an FFmpeg clone in Rust for you that would appear to be a novel work.
Of course, I cannot say that my own little bits and pieces of open source code would make a scratch in AI's capability, were it removed.
But I do strongly believe that if all the code that was published by authors with the same mindset was unavailable, Claude would be a far weaker developer.
> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.
Perhaps this illustrates a fissure that was always lurking under the surface, then. The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate, and that licenses like the GPL were meant to use copyright law to achieve something that resembles the state of affairs that might exist if copyright didn't exist in the first place. That's what the whole concept of "copyleft" always seemed to imply.
Now we have a new class of technologies that is admittedly fraught with a wide range of risks and pitfalls, but also a lot of promise to enable people to actually put the "four freedoms" into practice in ways they couldn't before, and we're seeing people who have normative opinions about AI derived from other, unrelated principles trying to circle the wagons and exclude those use cases. That is what seems like a breach of the social contract as I've always understood it.
> Did they mean literally CTRL+C, CTRL+V or something broader?
Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else. "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to, and the whole point of "copyleft" was to lessen the restrictions on even that.
> "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to
This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.
A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a "derivative work".
A training set is just an anthology, and the training process is condensation. That makes the weights a derivative work of every work in the training set.
Now, there's a separate discussion to be had about whether that derivative work meets the criteria for fair use, but that's it's own tangent.
> This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.
A derivative work is a work that itself includes copyrighted content from the original work.
That is to say that for something to be a derivative work, some measure of its content must be "CTRL-C, CTRL-V" from the originating work.
Something that's merely inspired by another work, or draws underlying themes or factual knowledge from it, is not a derivative work.
> A training set is just an anthology,
Which might make the training set itself a derivative work, but works created by using the model trained on that anthology are a different matter.
> and the training process is condensation.
No, it isn't. It's the creation of a new work that represents patterns extrapolated or interpolated from the data set, without the resulting model actually including any of the copyrighted elements of the work.
The underlying ideas and facts in the original work were never protected by copyright. Only the specific fixed form of expression is copyrightable.
Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.
> A derivative work is a work that itself includes copyrighted content from the original work.
If you put a GPL C program through Emscripten to run in a browser the output doesn't include the original C code but it's surely a derivative work.
> Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.
This is undoubtedly the core of the disagreement. Humans can learn from what they have seen, appreciate it, understand it, and draw on that experience in what they create. They do this without being considered ripoff artists, so why not machines that simulate the "same" thing automatically?
To me the answer is simply that humans are special. Human thought and human effort makes it creativity when a human does it, copying when a machine does it. It's a double standard I am perfectly willing to accept. I am unabashedly biased in this regard.
That may seem remarkably unfair to the machines, or like a cop-out. I just carved out a hardcoded special case for humans, and my whole philosophical reasoning is "because I said so". But how fair do we want to be? After all, if you want to treat a machine exactly like a human who learns from prior art to create new art, then the ownership of the new art would also belong to the machine. Not to the person who prompts it.
> If you put a GPL C program through Emscripten to run in a browser the output doesn't include the original C code but it's surely a derivative work.
Because it does include content from the original work -- this is just a translation, and isn't comparable to how LLMs work.
> To me the answer is simply that humans are special.
I don't disagree, but I also view LLMs as tools that extend human capacities and not autonomous entities unto themselves. LLMs are still just software, and can't really be regarded as anything other than instruments that humans use to broaden their capacity to see, appreciate, understand, and draw on that experience in what they create.
> That may seem remarkably unfair to the machines, or like a cop-out.
No, it's unfair to the humans. The machines are just tools that they use. The "double standard" is really a set of inconsistent standards applied to the same underlying moral agents.
> After all, if you want to treat a machine exactly like a human who learns from prior art to create new art, then the ownership of the new art would also belong to the machine. Not to the person who prompts it.
No, it always belongs to the person who prompts it. The machine is not a conscious entity, bears no intentions, and has no capacity to act on its own initiative. The machine is always just a tool that extends human capacity, as all machines always have.
For a good comparison here, we've never not credited a photographer as the author of a photograph. But the photographer is in a sense merely prompting the camera by framing the shot, selecting the exposure, adjusting the lighting, etc. -- the hard work in actually creating the photograph is being done by the camera itself, with the photographer playing no role in directly constructing the final image, and with the many of the qualities of the final image being determined by pre-existing features of the camera's functional design and components that the photographer also played no role in defining, apart from choosing which camera to use.
LLMs are like cameras in this way. And the fact that they rely on external data for model training no more disclaims the user as the author of the resulting work than looking things up in a dictionary or encyclopedia does the same for the author of an essay.
Perhaps the future will be less Idiocracy and more Futurama, with humans and robots living socially together.
> Perhaps this illustrates a fissure that was always lurking under the surface, then(...)
Yes, I do think there has always been such a fissure. People publish OSS code for many reasons, often a blend of multiple reasons. There are selfish reasons such as the desire for one's work to be recognized, or even the hope of getting better employment through showing ones' skill or making something companies will pay for support on. There are social reasons like the desire to collaborate with others. There are altruistic benefit-of-all-mankind reasons like Richard Stallman said "...restrictions reduce the amount and the ways that the program can be used. This reduces the amount of wealth that humanity derives from the program."
It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style. But even adherents of that style, I think, are not too happy with AI consumption of their code. For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared. And for two, if your idea of OSS is that we all put our contributions into the great shared river of human achievements to benefit the world, it is disappointing to see that river funneled into a giant waterwheel of profit for a half dozen trillion dollar companies charging rent for its bounty.
> Given that FOSS licenses were always constructed to function within applicable copyright law, I don't see how they could mean anything else.
I agree from a legal standpoint. I cannot enforce my personal definition of copying nor do I expect that to become possible. It was just conveniently aligned with the reality of how copying software worked in the past, and no longer is and never will be again. That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.
> It sounds like your view of things is limited mostly to that last version of FOSS, the copyleft style.
No, I'm well aware of the different motivations for and approaches to FOSS. I'm mostly focusing on the copyleft/GNU GPL side of the discussion here because that's the side of the house where most of ideas of a social contract and desire to see a specific ecosystem develop have been located. People on the MIT/BSD side of things, which has always had a much more direct "do whatever you want" ethos, are not the ones I'd expect to be making these arguments in the first place.
> For one, it allows laundering of the copyleft license so their work goes into closed-source products that are never shared.
I'd agree that someone using an LLM to create a deterministic transcription of someone else's work is indeed violating the license. But I think the argument goes beyond that, into using LLMs in any way at all.
> That doesn't mean I will be writing OSS software with a new made-up unenforceable license. It just means, like OP, I'll weigh differently whether I want to bother releasing stuff at all.
That's a reasonable position, and from the perspective of examining whether the current LLM climate is sapping motivation to participate in FOSS, I can understand where you're coming from.
But to that point, I'd argue that if your motivation was to gain recognition, participate in a community, etc. then you're going to lose those things by keeping your code private anyway, whereas you won't necessarily lose those things just because an LLM was trained on your code. If you contribute to a popular project, people were almost certainly already using your work to do things you don't approve of -- if that didn't take away your motivation, why would LLMs do much worse?
> The social contract that I've personally always attributed to FOSS communities was that attempting to restrict how people downstream of you use code is illegitimate,
That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").
Which part of which open source licenses gave you the impression that there were no restrictions?
> That's wrong. What on earth gave you that impression when the licenses specifically set constraints on what downstream can do (from "release derivatives as open" to "put me in the credits").
These are restrictions on redistribution, not use. And they're there to make sure that derivative works can't themselves impose restrictions on use.
One correction: the point of copyleft was to explot the restrictions in order to ensure that it would be possible for everyone to copy the software.
> "If you copy my work, you should share your work too."
Not exactly. The GPL way is that you should share my work under the same terms if you want to share it, even if modifying it.
You are not required to share anything if you don't actually share anything, and just run it yourself. That's where all the criticism towards cloud providers who freely use FLOSS is directed.
> But we are talking about a social contract, which is not quite the same thing. The social contract is what leads some devs who previously enjoyed publishing their work openly to no longer feel the same way.
There is clearly a misalignment in expectations from some FLOSS enthusiasts. The main FLOSS licenses focus exclusively on distribution, but their expectations somehow extend well beyond distribution. We hear those FLOSS enthusiasts criticize and attack companies for using software exactly according to their terms, and somehow that is framed as abuse if said users happen to be bigger than some arbitrary boundary.
No one consented to training llms, as the op clearly implies, if they had been asked they would have declined to do so. As would all of the many copyright holders who are in the process of suing the model companies.
Are you asking how AI coding agents, the companies selling them and the individuals using them break the FOSS social contract (copyleft, attribution, upstreaming), or are you disputing that they do?
Both would resolve to the same question, no?
There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code. I've yet to encounter any strong arguments substantiating this premise as a general principle, and my own suspicion is that it is not valid as a general principle, given the nature of how LLMs operate.
It's certainly possible that specific instances of LLMs lazily copy-pasting code from public repos may exist, and the extent to which this is happening is something that can be substantiated by empirical examples, so if you have any to point to, I'd be interested in looking at them. However, where this is happening, it ought to be regarded as a failure modality of LLMs, and not something that implicates the underlying nature of LLMs, given that their intended purpose is to function as stochastic generators that do not merely copy-paste input data.
My initial feeling here is that using open-source code to train LLMs is not per se a violation of the generally accepted FOSS social contract, but rather that attempting to restrict specific use cases of FOSS-licensed code on the basis of normative opinions unrelated to the license terms is a violation, or at least a rejection, of that social contract. I'm not fully committed to this position, though, and would welcome well-reasoned arguments to the contrary.
> Both would resolve to the same question, no?
Yes but my answer would be different. It can be either about what coding agents do (and you'll see that it breaks the social contract), or it can be about what the FOSS social contract is (and you'll argue that coding agents don't break it.) Lo and behold, it was the latter.
> There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code.
Not any work. But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.
There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?
The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.
AI gives the user all the benefits of using open source software with none of the obligations that come from using open source software. Developer gains nothing from going open source. It makes no sense for any developer to go open source. Social comtract breaks down, and it's all because AI users didn't hold up their half of the bargain.
> But if a specific work was generated based on a specific open source work, then according to the social contract that binds non-AI code generators such as transpilers, the output is derivative and should follow the license of that open source work.
I don't disagree with the premise that any LLM that is cloning code wholesale from a third-party repo is creating a derivative work, and the license terms apply to it.
But I also don't agree that non-AI code generators such as transpilers are in the same category as LLMs -- a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.
> There's also the question of whether the model itself is a redistribution. For every other lossy compression algorithm in history, the answer is a resounding yes. Is a model meaningfully different from a hypercompressed corpus of its learning data?
The model isn't a lossy compression archive that merely represents a collection of pre-existing works in parallel to each other. It's a probability matrix that relates together uniquely isolatable units of data to each other across the entire collection.
If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.
> The social contract of the open source (not to be confused with the legal contract of GPL, MIT etc.) is that developers give users software that they can use and modify in any way they want, and in exchange the users give the developer recognition and help with development and maintance, as well as give each other the assurance that the software will remain available to them and any future users.
I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.
There's always been both a social and legal obligation to properly attribute authors and abide by license terms when redistributing or forking FOSS code, but neither obligation has ever applied when learning programming techniques from FOSS code in order to write your own software. And the way LLMs are designed to work is more similar to the latter than to the former.
But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.
> a deterministic process that is simply parsing input from a single source and outputting it in a new form is not the same thing as a stochastic process that interpolates patterns from multiple sources and then uses those patterns to generate novel outputs.
There are stochastic compression algorithms (e.g. https://github.com/kaydotdev/sqic) and it would be insane to claim they don't produce derivative works. And as a general rule, a work based on multiple other works is derivative of all af them.
> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.
No, but your generated text is also useless if you want to read Hamlet. The danger I'm speaking of is people generating Hamlets but paraphrased - that's a derivative, especially if you use an automated tool that got original Hamlet as its input. Except the Hamlet in question is the Linux kernel but not bound by GPL. Also, your Markov chain itself is a derivative work.
> I don't think that is generally true. There's always been a hope and expectation that some subset of users would contribute back to the project in the ways you're describing, but never a sense of there being any obligation to do so. Only a fraction of FOSS users have ever contributed to back to the projects whose software they use.
True, but that fraction of a huge number is still big enough to be meaningful help. Plus the recognition. Most users respect the attribution clause. AI legally-distinct clones drop the fraction of helpers and the number of attributions straight down to 0. That changes the equation, what previously made sense now straight up doesn't.
> But in cases where LLMs actually are acting in ways similar to the former, I agree that they should be held accountable both socially and legally.
And because OpenAI et al. hold all the money and all the lawyers, the only way to hold them accountable is to stop publishing open source altogether. That's the only leverage OSS community has.
>If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.
If you write "To see or not to see, that is the question" about a person named Eyelet, who is going blind, how can you argue that it is NOT derivative of / borrowed from Hamlet? Yet that sentence is not in the work. Isn't that what LLMs essentially do? Tokenize, then substitute in new values for certain tokens, while retaining the general structure?
> If I build a Markov chain based on a statistical analysis of word sequences in Hamlet, and then use it to produce a new sentence that isn't found in the text of that work, I have not created a derivative work of Hamlet under any applicable sense of that term.
Uh, that is exactly what a derivative work is. You literally specify that Hamlet is an input to your work. I believe you're conflating derivative with transformative. You're certainly creating a transformative derivation of Hamlet, but you are by definition creating a derivative work by training a Markov chain on the text of Hamlet.
The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under. Others argue that there's not an exact copy of the original source in the LLM's weights so by definition it must be a transformative work. I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.
Personally, I don't care one way or the other. I'm one of the folks that thinks software shouldn't be copyright-able in the first place.
> Uh, that is exactly what a derivative work is.
No, it isn't. A derivative work isn't something based on extracting underlying ideas or patterns from another work, it's something that includes copyrighted portions of the other work.
An annotated edition of Hamlet is a derivative work. A Cliff's Notes summary of Hamlet is a derivative work.
Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet. A list of word counts of popular works of literature that includes an entry for Hamlet is also not a derivative work. The Markov chain described above is not a derivative work.
> The obvious follow up here is whether an LLM is creating transformative derivations or not. A lot of folks argue that yes, an LLM spitting out statistically sampled code that matches existing code is not transformative and is (or might be) infringing the terms of the license it was released under.
And I would agree with them. An LLM that actually is outputting non-trivial code that matches a public project's code verbatim is engaging in copying, and not stochastic inference.
> I think it's a pretty obvious "somewhere in the middle" that is gonna make a bunch of lawyers a whole lot of money.
It's a shame that the same fundamental questions have to be relitigated over and over again just because the contextual formalities and modes of expression have changed. I wonder how many of the legal cases are going to be copies or derivative works of previous ones.
> Strange Brew and The Lion King are not derivative works of Hamlet simply because they include literary themes and plot points that originated in Hamlet.
But try to write your own story of a lion cub chased away by his uncle and living in a jungle until his childhood friend finds him and convinces him to reclaim his kingdom, and you'll quickly hear from Disney's lawyers how non-derivative it really is.
OSS devs aren't worried about Hamlet reinterpretations. They're worried about legally-distinct-but-functionally-identical software clones. Unlike Disney, they don't have millions in their pockets to fight the legal battle. You know who does have millions? The people they'd be fighting against, who are going to use every single of your arguments to claim their AI-generated reimplementation of Kefir is not bound by GPL (or even by BSD 3-clause in case of runtime). No share-alike, no attribution, no nothing. If they are right, then the OSS social contract is dead. Even if they're not right, but behave as if they're right because they have lawyers and OSS devs don't - the social contract is just as dead.
> But try to write your own story of a lion cub chased away by his uncle and living in a jungle until his childhood friend finds him and convinces him to reclaim his kingdom, and you'll quickly hear from Disney's lawyers how non-derivative it really is.
I'd expect them to say "we don't like this, but since it's not actually a derivative work, we can't do anything about it". As long as you're not directly copying things like characters, dialogue, etc., it's not a derivative work.
That's why Armageddon is not a derivative work of Deep Impact, the Shark Attack series is not a derivative work of Jaws, the more famous Titanic is not a derivative work of 1979's S.O.S. Titanic, and the Harry Potter series is not a derivative work of Teen Witch.
Using the same story themes, plot points, and setting as another work does not implicate that other work's copyright. Only substantial copying of specifics does.
Yes, and obviously: bots crushing servers in strict contravention of the robots.txt rules.
“No, no, what was she wearing?”
People who take steps in response to social contract being broken are the ones responsible for the steps they've taken, not the ones who break the social contract.
Its definitely the ones DDOSing websites while giving no attribution in any way to the original creators.
DDOSing websites seems to be an unrelated problem, and one that has traditionally been solved through response throttling and IP blocking.
Attribution is often required even on MIT or BSD licenses where code is being redistributed, either in original or modified versions, but that would relate to this discussion only to the extent that one regards using LLMs whose training data included a certain bit of code as itself constituting redistribution of that specific code -- but that in turn is a very debatable premise which really ought to be argued for, and not merely argued upon as though it is already generally recognized as true.
Why? You stole my stuff and now are pretending I need to argue for you to stop stealing it. It's a joke.
This is the very question under debate. Training LLMs on publicly available data is a novel situation, and neither law nor social opinion have settled a consensus on the subject.
Copyright maximalists like to borrow unearned moral weight for their position by conflating copyright infringement with "stealing", but this is not actually true in any legal sense. It's not clear that training an AI on publicly available data should even constitute copyright infringement, much less "stealing".
What? What is being "stolen" from you?
Are you now layering the old and tired "copyright infringement = stealing" argument on top of the still unsubstantiated premise that all LLM training is copyright infringement?
> The sender pays, not the receiver.
You have a hole here. Your web server is sending the response and the bot is receiving.
Fix that and … profit? :-)
oh good point got that backwards… OMG my fax brain didn’t even think about it.
I'm trying to compose a better wording, but my attempts aren't working. The best I've got is:
> The initiator of the communication pays, not the server operator.
Really hate to say it, but I’ve stopped publishing my work too for this reason. I spend most of my time now building my own little software ark, and I aspire to no longer think of programming in the next few years. I feel like the creative economy in general will be unrecognizable in the near future, maybe nonexistent. I wonder what modes of collaboration on ideas might form in the next few years.
Here is what the purveyors of AI don't seem to realise. You can bend copyright law all you want in order to train your models on whatever you can grab, but in the absence of genuine protection of their creative work authors are simply not going to be publishing at all.
I think they see it all too well. They still think they can make bank today while it lasts, whatever comes after is some other shareholder's problem. And if we're talking about open source, killing it might be a positive side effect, they'll be ready to sell you a closed source alternative when you no longer have options.
I don't think we're going back to closed source. I think we're going back to guilds. Aka. closed knowledge.
Furthermore, if people not only stop publishing, but also take down already published works, it will create a moat around already existing Language Models
And the more they DDOS small websites — instead of respectfully scraping once — the more realistic my conspiracy theory looks.
People who are making stuff because they want to share it are still going to be publishing. And fighting to be noticed in an unending torrent of slop.
Without any material or immaterial benefits? And with one's work being ground up and turned into weights for the next version of the machine that's threatening one's employment?
I personally am sharing stuff because I want people to read my comics, and maybe join my crowdfunding campaigns.
If I could put everyone pushing all this AI crap into a meat grinder, I would.
> People who are making stuff because they want to share it are still going to be publishing.
Those people who do that are too few and far between to make a difference. The majority of open source devs aren't giving away the source without a license. That license is how they specify what they want in return.
> The majority of open source devs aren't giving away the source without a license.
100% of open source devs aren’t giving away the source without a license, since a licence—the grant of permissions for what is otherwise exclusive to author under the law—is what makes something open source.
> That license is how they specify what they want in return.
No, the license is how they legally give away permission to use material that is legally subjejct to their exclusive rights by virtue of creation. The license may be a contract license that, as you suggest, involves mutual exchange of value, but for many (especially permissive) open source licenses it is a gratuitous bounded grant of permission which has limits but does not involve giving something of value back to the creator.
> No, the license is how they legally give away permission to use material that is legally subjejct to their exclusive rights by virtue of creation. The license may be a contract license that, as you suggest, involves mutual exchange of value, but for many (especially permissive) open source licenses it is a gratuitous bounded grant of permission which has limits but does not involve giving something of value back to the creator.
Wrong. What they want in return is either credit or derivatives of the software. It's disingenuous to suggest that all these authors specifying, in a legal document, the exact mechanism by which to pay them back don't know what they are asking.
If you're not happy with that trade, then don't make it.
Great. More work for AI then.
The sad thing is I feel trapped on all sides of the debate, I wrote a book about LLMs and human creativity (spoiler Humans win for a long time) but I was going to do it as a blog series, instead I published https://www.amazon.com/dp/B0GXCSY4W8 because I felt at least I might get a bit back for literally 100’s of hours of my life I poured into the book and my editor and friends who read and provided reviews.
And I push a lot of open source code including a ton for the SWGEmu project, but now I’m of mixed mind to stop pushing anything public. I can’t decide, am I talking out of both sides of my mouth, it’s a confusing time to navigate for sure.
Indeed sad, congrats on publishing your book though. I’ve certainly felt a bit of that same angst myself.
I think SWGEmu (cool project, just learned of it from you!) do represent some optimism though. Maybe these sorts of passion projects will take over the space?
> Really hate to say it, but I’ve stopped publishing my work too for this reason.
Me too; not that I've published a lot, but definitely more than most. That won't be happening anymore.
Incredibly rich to complain about LLM scraping with LLM generated article.
This project in particular has been unconcerned with new coding practices so far, primarily, because I derive pleasure from hand-written implementations of my ideas, and believe that overcoming challenges the hard way is the main value I get from it.
This 100% the same for me. Outside of work where speed is more important than quality, and I work with people that use AI, I don't use AI at all on my own projects. It poisons the mind and the soul. Ok that sounds dramatic, but I felt down up until the point where I started hand writing everything again. Software engineering is still fun and powerful, and the hell with where the world is going.I'm also very hesitant to release any new works (code, artworks, etc.) to the public. I usually release code under the GPL or AGPL, but I don't think any of those choices are properly respected by the AI crawlers, and subsequent "mixing into" those models.
Multiple times I got partially broken "citations" of GPL licensed code out of the models as answers to basic research questions (aka prompts) w/o any mentioning of the original license applied to the code. Just adding some random bugs every 10th line doesn't make it not a direct derivate. Image generators happily generated Sonics or Bart Simpsons (w/o directly prompting for that either). No mentions that those are copyrighted characters either.
I have gone the other way, I used to release things under MIT licence, but have switched to public domain or unlicenced.
I mostly make things because I felt they should be made. I am fine with what I produce being used by others provided they don't take it away from anyone else.
I was never very happy with the selfishness of the GPL, which is why I tended to prefer MIT, but the stances taken by people in recent years made me realise that nobody owns ideas, and even attribution is commoditised.
I am ok with voluntary attribution so that it may be used as a means to confirm additional information. I don't like the idea that if I think of something, someone else is not allowed to think about it without my permission.
Citation farming is a problem that happened because the value of the idea was placed on the names attached to it. That generated motivation to attach names to ideas as a way to gain power or prestige. To take credit for someone else's idea can only occur is because people have put the credit value onto the person and not the idea. Many of those names are of no use when it comes to verifying if the idea is sound, it's creating a denial of service attack on the ability to validate.
I understand the realities of commerce and academia that put these things in place, and how those who work within those frameworks have to do so in a way that is compatible with them.
I don't like it though, I think it makes the world less informed and less free. I don't have to create under those frameworks myself, so I made the decision to make any idea I have to not be bound to my will or identity.
Seems to me LLMs have changed some things. I'm not sure how it's best put, but it used to be:
- Seeing code (or a blogpost or whatever) was a result from effort where thought had gone into it. The writer paid effort so the reader didn't have to.
- There'd be some level of attachment to what you've put effort into.
With LLMs, that's undermined: it's easy to produce thoughtless imitations. Code or comments where thought didn't go into it. So, seeing some result isn't an indication of skill, but also not even an indication thought went into it.
I guess there's still something lost if someone isn't going to share code they've put thought into. -- But on the other hand, if it's just for me & I don't have to share it with a wider audience, getting LLMs to write out code isn't so expensive.. so code itself isn't necessarily something to value so much.
But LLMs don’t seem particularly good at inventing new ways to code (or write, or…). It’s literally all derivative. So what happens in 10 years? Are we headed for a great stagnation?
> But LLMs don’t seem particularly good at inventing new ways to code (or write, or…). It’s literally all derivative.
I think the key part is how much thought goes into something.
Optimistically, LLMs are good at taking unstructured input, and (probably) producing the intended output from that. -- This allows for an interesting new way of coding: a set of instructions don't need to be as rigorous as a shell script, but can be natural language.
That part surely extends creativity. An LLM will be familiar with domain ideas I'm not, even if an LLM is completely disinterested in doing things.
Pessimistically, I think it's still not clear what the right way of interacting online with all of this is (other than clear expectations of "no AI")... in some sense LLM output is worthless to share, in the sense that I'm just as capable of asking the LLM to output something as anyone else is.
Let LLMs ingest its own output, everything past 2022 will be increasingly hallucinatory self-regurgitation.
That’s because they cannot invent anything. They’re reductive, not creative.
It’s like arguing that nobody is going to invent new ways to ride horses in the age to automobile.
If the way humanity advances were via new ways to ride horses, then yes.
You made me curious. Has anyone invented new ways to ride horses in the age of the automobile?
Best I could find: https://www.science.org/doi/10.1126/science.1174605
There was a relatively big shift in riding style right around the same time of the first mass production of vehicles.
I don't know... I've been writing code for good twenty years (15 professionally).
First, I think it's the best time to write software since so much boring stuff can be automated. I can put my thoughts into what I'm trying to achieve instead of how. To put it otherwise, I think about big picture much more than about mundane details like dealing with particularities of a programming language.
Second, most people were using SO to solve just about any issue they had. The number of developers producing truly original code was minimal even 10 years ago.
One of the very few small compilers which passes the full gcc torture tests. But for me kefir is good enough as the reference small compiler. Not as fast as tcc, but more correct
I've been taking a look at the source and it's a work of art :O
Surprised no one has yet linked to the source https://sr.ht/~jprotopopov/kefir/
People in other professions are jumping on this bandwagon - Tony Gilroy decided not to publish Andor TV show scripts to prevent AI companies using them for training.
see https://variety.com/2025/tv/news/andor-creator-refuses-publi...
So how big is the community around this project?
If a one-person show, closing it up would effectively kill it? Or (re?)turn it into a hobby project developed at snail pace.
If some community exists: fork coming up?
One person show. Effectively, it is dead since now it became the proprietary toy of its author. The author is entitled to do what he wants with his own creation, however.
I put my site behind a username/password wall, to block LLM bots.
Spambots learned to autoregister 30 years ago. Do LLMs not do that? Crazy.
User has to email me for access.
same, not worth getting 100GB of content getting scrapped every other day.
It was nice hearing about it. If this is a healthy direction for the project, then so be it. At least source to previous versions is still available.
I'm finding it hard to be motivated to continue on language dev work. I feel it may also have to do with AI. Not so much the predatory aspect of it, like this author, but something else: shall we say, certain revelations about the nature of the target audience.
What a well-rounded nicely written announcement that touches on all parts of the argument without any rage baiting or flex etc. It would be easy to just ramble against AI and how its the end of the world etc but the author focused on a point that's not even related to use or misue of AI in software but rather how we have made it acceptable that large corporate companies can skirt copyright without any issue and make rivers of money with it. This problem extends not only to coding but other industries as well.
Instead of a derivative work we have a machine that creates derivative works. I fail to see how this is fair use.
Same situation some time ago with Solar assembler
People taking your work and not giving anything back was ALWAYS the risk you took when writing free software. LLM training doesn't change that much. That the us military no doubt is using gcc to compile embedded software for their icbm:s no doubt irks the gnu people. But you can't have it any other way. "You can only use my software for good things" just is not consistent with "free software".
Yeah, I really can't comprehend these sentiments as anything other than an "I don't like AI" argument. FOSS has always been about just writing code and putting it out into the world where others can do as they please with it.
I see a lot of risks involved in people surrendering their own decision-making to LLMs, but that's a question of how they're used, not how they're trained. The idea that using FOSS software to train LLMs is somehow a violation of FOSS norms just doesn't seem valid.
> FOSS has always been about just writing code and putting it out into the world where others can do as they please with it.
That is wrong. How can you write that with a straight face? There are projects that are put into the public domain (one major one comes to mind), but the clear majority of FOSS projects have strings attached which make the intention of the authors absolutely clear.
IOW, if you're not happy with what the cost of the product is, then just don't use it.
I mean, the most restrictive license, the GPL, was conceived specifically to protect the "four freedoms" and prevent subsequent modifications from violating them. The "copyleft" concept was specifically designed to create an ecosystem that behaved as if copyright didn't apply in the first place.
I don't know how you can imply with a straight face that it did anything else.
I don't know how you can possibly argue that non-redistributive usage of software could ever violate the GPL -- and the other common FOSS licenses don't even have the copyleft provision, and literally are saying "do whatever you want, but I'm not responsible".
> The "copyleft" concept was specifically designed to create an ecosystem that behaved as if copyright didn't apply in the first place.
And if copyright didn't exist in the first place we wouldn't be having this conversation, because the models created by all the token providers will be open to all for whatever use that anyone wanted.
But it does exist, and within this framework, the creator gets to say how you may redistribute their IP, and "We compressed it very much" isn't an out.
> But it does exist, and within this framework, the creator gets to say how you may redistribute their IP,
Right. And the way the creator gets to exercise that say is by releasing their work under a license. If you release your work under a FOSS license, you're saying "you are free to copy this work and use it for your own purposes".
Complaining that people are using it for purposes you don't like after you've already given permission to them to use it for whatever purposes they please seems a bit disingenuous.
> and "We compressed it very much" isn't an out.
It's not, but I don't think we're discussing that. We're talking about LLMs, not people redistributing zip files containing someone else's work. If you're trying to imply that LLMs are merely a form of compression, that's a position you've got to argue for, because I'm definitely not seeing any similarity between the two.
> that behaved as if copyright didn't apply in the first place.
If copyright didn't exist then the share-alike and anti-tivoization clauses wouldn't work, FOSS in general wouldn't even protect attribution. Copyleft ecosystems depend on some amount of copyright law to uphold themselves.
> FOSS has always been about just writing code and putting it out into the world where others can do as they please with it.
Not true. Most FOSS licenses require attribution and many require derivatives to be released under the same license.
Sure, but I guess I'm not seeing the relevance here. Are we seeing some greater-than-normal wave of people redistributing FOSS code without attribution, or creating derivative works without adhering to the license terms? LLM training doesn't seem to be either of these things.
We are seeing megacorporations (SlopenAI, Antslopic, Microslop, etc.) distributing derivatives of open-source code (their LLMs) without attribution.
Can you point to some specific examples of products shipped by the companies I assume you're referring to here that are in fact unattributed derivative works of GPL-licensed software?
Or are you saying that you think anything generated by an LLM qualifies as a derivative work of anything included in its training data?
The latter.
It's a tool, if using data is necessary to make the tool work, then its output derives from the data.
If the LLM generation is not derivative of its training data, then why would it need the training data in the first place?
> It's a tool, if using data is necessary to make the tool work, then its output derives from the data.
That's simply not correct within the applicable meaning of "derives" as understood in copyright law. In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not.
Even creating works that merely draw on a single source of data, but express the ideas drawn from that in a new or transformative way, are not considered derivative works (see the ruling in Google v. Oracle, for example), let alone works based on patterns extrapolated by relating together ideas sourced from many distinct works, which is what LLMs are principally doing.
If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case.
> That's simply not correct within the applicable meaning of "derives" as understood in copyright law.
Would be rather hard to write a definition that handles it properly back when LLMs didn't exist; not that laws particularly have anything to do with intent/desires behind FOSS anyway - intent is clearly there: you get code, under the condition that if you use it for anything, I get credited; else, you get nothing.
> In fact, data per se is not even within the scope of copyright protection in the first place: specific published works are copyrighted, but the underlying ideas and facts that they convey are not.
Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.
> If you applied the principle you're proposing here to human developers, you'd conclude that any code written by someone who learned to program by studying techniques used in FOSS software would in turn be a derivative work of that software. No one has ever regarded this to be the case.
Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent. (not to mention the obvious differentiating factor of LLMs having arbitrarily-good memory unlike humans)
> under the condition that if you use it for anything, I get credited; else, you get nothing.
But this has never been a condition in the FOSS world, as far as I'm aware. I've only ever seen attribution requirements attach to redistribution of source, not usage of the software.
I understand that the crux of the debate here is whether training an LLM is redistribution of the underlying code, but to me, it seems to be fairly clear that it is not.
> Luckily, FOSS is specific published works, and unless LLMs actually reasonably-provably do such decomposing into ideas/facts (good luck reasoning about that), that part is also irrelevant.
That's literally all LLMs do. That's what tokenization is. And it's trivially provable, since if you compare LLM models with the copyrighted works you're claiming they replicate, all you'll see on the LLM side is probability matrices representing correlations between decomposed units of knowledge aggregated across the entire dataset as an integrated whole.
> Depending on intent, that very much can happen, it's called plagiarism. Good luck proving an LLMs intent.
The only intent ever in play is that of the user. LLMs are just software.
There's an almost intergalactic level of irony in the extent to which open source has benefited giant corporations and the military at the expense of individuals, and ultimately contributed to the commercialised enclosure of software IP.
I suppose you could argue it also indirectly led to the empowerment of non-developers to create their own vibe coded solutions. But we're not quite there yet.
And the AI IP that makes that possible is still enclosed rather than open.
> There's an almost intergalactic level of irony in the extent to which open source has benefited giant corporations and the military at the expense of individuals, and ultimately contributed to the commercialised enclosure of software IP.
Could you perhaps explain that irony a bit more explicitly?
Can you provide any examples of "commercialized enclosure of software IP" somehow backwashing into the FOSS ecosystem and closing things up that are already open?
Don't open-weight models sort of returning the favor?
Sure, Free Software hasn't been the vehicle for societal change that RMS and others certainly hoped. I remember being flamed out in a user group for suggesting that our conference shouldn't be held in a "non-free" country such as Morocco, Turkey, or China because it's counter-productive to freedom. Very few people actually got it. But it's orthogonal to LLM trainers also using free software in "non-approved" ways.
> it also indirectly led to the empowerment of non-developers to create their own vibe coded solutions.
Nobody is empowered to do that because the models to do that aren't free.
> But we're not quite there yet.
Judging from the number of projects I've seen from people who aren't software developers, we're there enough.
Before LLMs, you could use the GNU GPL or other copyleft licenses to protect your code from being used to develop non-free software. Unfortunately, the courts have decided that LLMs are free to ignore licenses.
Copyleft is about republishing. You can't prevent anyone from using your compiler or text editor to develop non-free software.
The gcc torture tests are no joke. I skimmed them once thinking I’d write a toy C compiler. Thousands of test cases covering edge cases I’d never even thought about. Respect to anyone who gets through the full suite.
> I also do not want my future work to be exploited for naught in commercial purposes.
Other people using your code to enrich their lives or businesses doesn't exploit you in any way, as it doesn't cost you a thing. This is irrational.
Also irrational because just as others benefit from his code, he benefits from theirs. LLMs fulfill the promise of Open Source, they don't violate it.
As long as they are universally available, that is. That's the part people should be concerned about.
I have many GPL projects (e.g. https://github.com/rochus-keller/Oberon, https://github.com/rochus-keller/Luon, https://github.com/rochus-keller/Micron) and spend a significant amount of time in them. GPL has always explicitly permitted commercial use; that's a feature, not a bug, dating back to Stallman's original vision. Any person or company can use my code (or Kefir code) under the terms of the GPL, as I use code given away by companies under GPL or even more liberal licences for free. That's the deal. GPL is a license explicitly designed to maximize use, so it doesn't make sense to object to a specific form of use. The claim that AI companies are somehow violating GPL by training on GPL code is legally baseless (I studied law here in Switzerland and had lectures about international IP law); also the FSF itself has not claimed otherwise; even if it were prohibited, it would be a copyright enforcement problem, and not a reason to stop publishing. I don't know Kefir, but it looks like a great (even optimizing) compiler. So it's really a pitty that its development is no longer open source.
The GPL, unlike the BSD and such, intends to prevent the closing of distributed derivative works. LLMs trained on GPL code can produce derivative works without any enforcement mechanism.
You may be fine with that, but the GPL is not a public domain license, and LLM training treats all things as if they were public domain.
> LLMs trained on GPL code can produce derivative works
This confuses two completely separate things. GPL governs distribution of derivative works. An LLM trained on GPL code does not distribute that code. The model weights are not a copy, a derivative, or a distribution of the training data in any legally recognizable sense; "influenced by" is not "derived from". The enforcement argument is a non sequitur; the GPL has never had a technical enforcement mechanism; it's always been legally enforced after the fact by copyright holders who discover violations. So if the LLM would indeed produce output sufficiently similar to my code and someone would publish it in violation of GPL, I have the same legal means to enforce my rights as if the code was copied by a human.
> An LLM trained on GPL code does not distribute that code.
You can't simply make that assertion. You'll have to prove that LLMs do not actually contain encoded copies of copyrighted code and that they are incapable of reproducing such code verbatim.
There is no evidence for such a claim, and so your entire argument is completely baseless.
> You'll have to prove that LLMs do not actually contain encoded copies
In law, the presumption is that an act is lawful unless proven otherwise. The burden lies on whoever claims a violation occurred. I already went into the case of sufficiently similar reproduction in my previous response.
I mean… it's been common knowledge for a while that they do in fact contain the original data.
https://www.reddit.com/r/programming/comments/oc9qj1/copilot...
You can disagree all you want, but there's ample evidence of this.