How do you make a language model? Goes like this: erect a trellis of code, then allow the real program to grow, its development guided by a grueling training process, fueled by reams of text, mostly scraped from the internet. Now. I want to take a moment to think together about a question with no remaining practical importance, but persistent moral urgency:
Is that okay?
The question doesn’t have any practical importance because the AI companies —
The question does still have moral urgency because, at its heart, it’s a question about the things people all share together: the hows and the whys of humanity’s common inheritance. There’s hardly anything bigger.
And, even if the companies and the enthusiasts rampage ahead, there are still plenty of us who have to make personal decisions about this stuff every day. You gotta take care of your own soul, and I’m writing this because I want to clarify mine.
A few ground rules.
First, if you (you engineer, you AI acolyte!) think the answer is obviously “yes, it’s okay”, or if you (you journalist, you media executive!) think the answer is obviously “no, it’s not okay”, then I will suggest that you are not thinking with sufficient sensitivity and imagination about something truly new on Earth. Nothing here is obvious.
Second, I’d like to proceed by depriving each side of its best weapon.
On the side of “yes, it’s okay”, I will insist that the analogy to human learning is not admissible. “Don’t people read things, and learn from them, and produce new work?” Yes, but speed and scale always influence our judgments about safety and permissibility, and the speed and scale of machine learning is off the charts. No human, no matter how well-read, could ever field requests from a million other people, all at once, forever.
On the side of “no, it’s not okay”, I will set aside any arguments grounded in copyright law. Not because they are irrelevant, but because … well, I think modern copyright is flawed, so a victory on those grounds would be thin, a bit sad. Instead, I’ll defer to deeper precedents: the intuitions and aspirations that gave rise to copyright in the first place. To promote the Progress of Science and useful Arts, remember?
I hope partisans of both sides will agree this is a fair swap. Put down your weapons, and let’s think together.
I want to go carefully, step by step —
I’ll begin with my sense of what language models are doing. Here it is: language models collate and precipitate all the diverse reasons for writing, across a huge swath of human activity and aspiration. Count off those reasons: to inform, to persuade, to sell this stupid alarm clock, to dump the CUSTOMERS table into a CSV file … and you realize it’s a vast field of desire and action, impossible to hold in your head.
The language models have many heads.
To make this work —
In fact, the only trove known to produce noteworthy capabilities is: the entire internet, or close enough. The whole browsable commons of human writing. From here on out, we’ll call it Everything, which is short for Everything That Can Be Accessed Online, Plus Some Extra Pirated Stuff, Probably.
This is what makes these language models new: there has never, in human history, been a way to operationalize Everything. There’s never been anything close.
Just as, above, I set copyright aside, I want also to set aside fair use and the public domain. Again, not because they are irrelevant, but because those intuitions and frameworks all assume we are talking about using some part of the commons —
I mean: ALL of it!
If language models worked like cartoon villains, slurping up Everything and tainting it with techno-ooze, our judgment would be easy. But of course, digitization is trickier than that: the airy touch of the copy complicates the scenario.
The language model reads Everything, and leaves Everything unchanged —
Is that okay?
As we begin to feel our way across truly new terrain, we can inquire: how much of the value of these models comes from Everything? If the fraction was just one percent, or even ten, then we wouldn’t have much more to say.
But the fraction is, for sure, larger than that.
What goes into a language model? Data and compute.
For the frontier models like Claude, data means: Everything.
Compute combines two pursuits:
-
software: the trellises and applications that support the development and deployment of these models, and
-
hardware: the vast sultry data centers, stocked with chips, that give them room to run
There’s a lot of value in those pursuits; I don’t take either for granted, or the labor they require. The experience you get using a model like Claude depends on an ingenious scaffolding. Truly! At the same time: I believe anyone who works on these models has to concede that the trellises and the chips, without data, are empty vessels. Inert.
Reasonable people can disagree about how the value breaks down. While I believe the relative value of Everything in this mix is something close to 90%, I’m willing to concede a 50/50 split.
And here is the important thing: there is no substitute.
You’ve probably heard about the race to generate novel training data, and all the interesting effects such data can have. It is sometimes lost in those discussions that these sophisticated new curricula can only be provided to a language model already trained on Everything. That training is what allows it to make sense of the new material.
Also, it is often the case —
It’s Everything, all the way down.
Would it be possible to commission a fresh body of work, Everything’s equal in scale and diversity, without any of the encumbrances of the commons? If you could do it, and you trained a clean-room model on that writing alone, I concede that my question would be moot. (There would be other questions! Just not this one.) Certainly, with as much money as the AI companies have now, you’d expect they might try. We know they are already paying to produce new content, lots of it, across all sorts of business and technical domains.
But this still wouldn’t match the depth and richness of Everything. I have a hypothesis, which naturally might be wrong: that it is precisely the naivete of Everything, the fact that its writing was actually produced for all those different reasons, that makes it so valuable. Composing a fake corporate email, knowing it will be used to train a language model, you’re not doing nothing, but you’re not doing the same thing as the real email-writer. Your document doesn’t have the same … what? The same grain. The same umami.
Maybe one of these companies will spend ten billion dollars to commission a whole new internet’s worth of text and prove me wrong. However, I think there are information-theoretic reasons to believe the results of such a project would disappoint them.
So! Understanding that these models are reliant on Everything, and derive a large fraction of their value from it, one judgment becomes clear:
If their primary application is to produce writing and other media that crowds out human composition, human production: no, it’s not okay.
For me, this is intuitively, almost viscerally, obvious. Here is the ultimate act of pulling the ladder up behind you, a giant “fuck you” to every human who ever wanted to accomplish anything, who matched desire to action, in writing, part of Everything. Here is a technology founded in the commons, working to undermine it. Immanuel Kant would like a word.
Fine. But what if that isn’t the primary application? What if language models, by collating and precipitating all the diverse reasons for writing, become flexible general-purpose reasoners, and most of their “output” is never actually read by anyone, instead running silent like the electricity in your walls?
It’s possible that language models could go on broadening and deepening in this way, and eventually become valuable aids to science and technology, to medicine and more.
This is tricky —
They can do that! They probably will. And the claim might still be true.
If super science is a possibility —
(I know that seems awfully consequentialist. Would I sacrifice anything, or everything, for super science? No. But art and media can find new forms. That’s what they do.)
Obviously, this scenario is especially appealing if the super science, like Everything at its foundation, flows out into the commons. It should.
So —
For my part, I think the chance of super science is below fifty percent, owing mostly to the friction of the real physical world, which the language models have, so far, avoided. But, I also think the chance is above ten percent, so, I remain curious.
It’s not unreasonable to find this wager suspicious, but if you do, I might ask: is there any possible-but-unproven technology that you think is worth pursuing even at the cost of itchy uncertainty in the present? If the answer is “yes, just not this one”: fair enough. If the answer is “no”: aha! I see you’ve answered the question at the top of this page for yourself already.
Where does this leave us?
I suppose it’s not surprising, in the end:
If an AI application delivers some profound public good, or even if it might, it’s probably okay that its value is rooted in this unprecedented operationalization of the commons.
If an AI application simply replicates Everything, it’s probably not okay.
I’ll sketch out my current opinions more specifically:
I think the image generation models, trained on the Everything of pictures, are: probably not okay. They don’t do anything except make more images. They pee in the pool.
I think the frontier models like Claude are: probably okay. If it seemed, a couple of years ago, that they were going to be used mainly to barf out text, that impression has faded. It’s clear their applications are diverse, and often have more to do with processes than end products.
The case of translation is compelling. If language models are, indeed, the Babel fish, they might justify the operationalization of the commons even without super science.
I think the case of code is especially clear, and, for me, basically settled. That’s both (1) because of where code sits in the creative process, as an intermediate product, the thing that makes the thing, and (2) because the commons of open-source code has carried the expectation of rich and surprising reuse for decades. I think this application has, in fact, already passed the threshold of “profound public good”: opening up programming to whole new groups of people.
But, again, it’s important to say: the code only works because of Everything. Take that data away, train a model using GitHub alone, and you’ll get a far less useful tool.
Maybe (it turns out) I’m less interested in litigating my foundational question and more interested in simply insisting on the overwhelming, irreplaceable contribution of this great central treasure: all of us, writing, for every conceivable reason; desire and action, impossible to hold in your head.
Did we make progress here? I think so. It’s possible my question, at the outset, seemed broad. In fact, it’s fairly narrow, about this core mechanism, the operationalization of the commons: whether I can live with it, or not.
One extreme: if these machines churn through all media, and then, in their deployment, blow away any prospect for a healthy market for human-made media, I’d say, no, that’s not what we want from technology, or from our future.
Another extreme: if these machines churn through all media, and then, in their deployment, discover several superconductors and cure all cancers, I’d say, okay … we’re good.
What if they do both? Well, it would be a bummer for media, but on balance I’d take it. There will always be ways for artists to get out ahead again. More on that in another post.
I also think there are some potential policy remedies that would even out the allocation of value here —
In this discussion, I set copyright and fair use aside. I should say, however, that I’m not at all interested in clearing the air for AI companies, legally. They’ve chosen to plunge ahead into new terrain —