Is it okay?

13 min read Original article ↗
February 11, 2025
Macbeth Consulting the Witches, 1825, Eugène Delacroix
Macbeth Consulting the Witches, 1825, Eugène Delacroix

How do you make a lan­guage model? Goes like this: erect a trellis of code, then allow the real pro­gram to grow, its devel­op­ment guided by a gru­eling training process, fueled by reams of text, mostly scraped from the internet. Now. I want to take a moment to think together about a ques­tion with no remaining prac­tical impor­tance, but per­sis­tent moral urgency:

Is that okay?

The ques­tion doesn’t have any prac­tical impor­tance because the AI com­pa­nies — and not only the com­pa­nies, but the enthu­si­asts, all over the world — are going to keep doing what they’re doing, no matter what.

The ques­tion does still have moral urgency because, at its heart, it’s a ques­tion about the things people all share together: the hows and the whys of humanity’s common inheritance. There’s hardly any­thing bigger.

And, even if the com­pa­nies and the enthu­si­asts ram­page ahead, there are still plenty of us who have to make per­sonal deci­sions about this stuff every day. You gotta take care of your own soul, and I’m writing this because I want to clarify mine.


A few ground rules.

First, if you (you engineer, you AI acolyte!) think the answer is obvi­ously “yes, it’s okay”, or if you (you journalist, you media executive!) think the answer is obvi­ously “no, it’s not okay”, then I will sug­gest that you are not thinking with suf­fi­cient sen­si­tivity and imag­i­na­tion about some­thing truly new on Earth. Nothing here is obvious.

Second, I’d like to pro­ceed by depriving each side of its best weapon.

On the side of “yes, it’s okay”, I will insist that the analogy to human learning is not admissible. “Don’t people read things, and learn from them, and pro­duce new work?” Yes, but speed and scale always influ­ence our judg­ments about safety and permissibility, and the speed and scale of machine learning is off the charts. No human, no matter how well-read, could ever field requests from a mil­lion other people, all at once, forever.

On the side of “no, it’s not okay”, I will set aside any argu­ments grounded in copy­right law. Not because they are irrelevant, but because … well, I think modern copy­right is flawed, so a vic­tory on those grounds would be thin, a bit sad. Instead, I’ll defer to deeper precedents: the intu­itions and aspi­ra­tions that gave rise to copy­right in the first place. To pro­mote the Progress of Sci­ence and useful Arts, remember?

I hope par­ti­sans of both sides will agree this is a fair swap. Put down your weapons, and let’s think together.


I want to go carefully, step by step — yet I want to do so with brevity. Lan­guage models pro­duce so … many … WORDS, and they seem to coax just as many out of their critics. Log­or­rhea begets logorrhea. We can do better.

I’ll begin with my sense of what lan­guage models are doing. Here it is: lan­guage models col­late and pre­cip­i­tate all the diverse rea­sons for writing, across a huge swath of human activity and aspiration. Count off those rea­sons: to inform, to persuade, to sell this stupid alarm clock, to dump the CUSTOMERS table into a CSV file … and you realize it’s a vast field of desire and action, impos­sible to hold in your head.

The lan­guage models have many heads.

To make this work — you already know this, but I want to under­score it — only a truly rich trove of writing suffices. Train a lan­guage model on all of Shake­speare’s works and you won’t get any­thing useful, just a brittle Shake­speare imitator.

In fact, the only trove known to pro­duce note­worthy capa­bil­i­ties is: the entire internet, or close enough. The whole brows­able com­mons of human writing. From here on out, we’ll call it Every­thing, which is short for Every­thing That Can Be Accessed Online, Plus Some Extra Pirated Stuff, Probably.

This is what makes these lan­guage models new: there has never, in human history, been a way to oper­a­tionalize Every­thing. There’s never been any­thing close.

Just as, above, I set copy­right aside, I want also to set aside fair use and the public domain. Again, not because they are irrelevant, but because those intu­itions and frame­works all assume we are talking about using some part of the com­mons — not all of it.

I mean: ALL of it!

If lan­guage models worked like car­toon villains, slurping up Every­thing and tainting it with techno-ooze, our judg­ment would be easy. But of course, dig­i­ti­za­tion is trickier than that: the airy touch of the copy com­pli­cates the sce­nario.

The lan­guage model reads Every­thing, and leaves Every­thing unchanged — yet sud­denly this new thing exists, with strange and for­mi­dable powers.

Is that okay?


As we begin to feel our way across truly new terrain, we can inquire: how much of the value of these models comes from Every­thing? If the frac­tion was just one percent, or even ten, then we wouldn’t have much more to say.

But the frac­tion is, for sure, larger than that.

What goes into a lan­guage model? Data and compute.

For the fron­tier models like Claude, data means: Every­thing.

Com­pute com­bines two pursuits:

  1. software: the trel­lises and appli­ca­tions that sup­port the devel­op­ment and deploy­ment of these models, and

  2. hardware: the vast sultry data centers, stocked with chips, that give them room to run

There’s a lot of value in those pursuits; I don’t take either for granted, or the labor they require. The expe­ri­ence you get using a model like Claude depends on an inge­nious scaffolding. Truly! At the same time: I believe anyone who works on these models has to con­cede that the trel­lises and the chips, without data, are empty vessels. Inert.

Rea­son­able people can dis­agree about how the value breaks down. While I believe the rel­a­tive value of Every­thing in this mix is some­thing close to 90%, I’m willing to con­cede a 50/50 split.

And here is the impor­tant thing: there is no substitute.

You’ve prob­ably heard about the race to gen­erate novel training data, and all the inter­esting effects such data can have. It is some­times lost in those dis­cus­sions that these sophis­ti­cated new cur­ricula can only be pro­vided to a lan­guage model already trained on Every­thing. That training is what allows it to make sense of the new material.

Also, it is often the case — not always, but often — that the novel training data is gen­erated by … a lan­guage model … which has itself been trained on … you guessed it.

It’s Every­thing, all the way down.

Would it be pos­sible to com­mis­sion a fresh body of work, Every­thing’s equal in scale and diversity, without any of the encum­brances of the com­mons? If you could do it, and you trained a clean-room model on that writing alone, I con­cede that my ques­tion would be moot. (There would be other ques­tions! Just not this one.) Certainly, with as much money as the AI com­pa­nies have now, you’d expect they might try. We know they are already paying to pro­duce new content, lots of it, across all sorts of busi­ness and tech­nical domains.

But this still wouldn’t match the depth and rich­ness of Every­thing. I have a hypothesis, which nat­u­rally might be wrong: that it is pre­cisely the naivete of Every­thing, the fact that its writing was actu­ally pro­duced for all those dif­ferent rea­sons, that makes it so valu­able. Com­posing a fake cor­po­rate email, knowing it will be used to train a lan­guage model, you’re not doing nothing, but you’re not doing the same thing as the real email-writer. Your doc­u­ment doesn’t have the same … what? The same grain. The same umami.

Maybe one of these com­pa­nies will spend ten bil­lion dol­lars to com­mis­sion a whole new internet’s worth of text and prove me wrong. However, I think there are infor­ma­tion-theoretic rea­sons to believe the results of such a project would dis­ap­point them.


So! Under­standing that these models are reliant on Every­thing, and derive a large frac­tion of their value from it, one judg­ment becomes clear:

If their pri­mary appli­ca­tion is to pro­duce writing and other media that crowds out human composition, human production: no, it’s not okay.

For me, this is intuitively, almost viscerally, obvious. Here is the ulti­mate act of pulling the ladder up behind you, a giant “fuck you” to every human who ever wanted to accom­plish any­thing, who matched desire to action, in writing, part of Every­thing. Here is a tech­nology founded in the com­mons, working to under­mine it. Immanuel Kant would like a word.

Fine. But what if that isn’t the pri­mary appli­ca­tion? What if lan­guage models, by col­lating and pre­cip­i­tating all the diverse rea­sons for writing, become flex­ible general-purpose reasoners, and most of their “output” is never actu­ally read by anyone, instead run­ning silent like the elec­tricity in your walls?

It’s pos­sible that lan­guage models could go on broad­ening and deep­ening in this way, and even­tu­ally become valu­able aids to sci­ence and tech­nology, to med­i­cine and more.

This is tricky — it’s so, so tricky — because the claim is both (1) true, and (2) convenient. One wishes it wasn’t so convenient. Can’t these com­pa­nies simply promise, with every passing year, that AI super sci­ence is just around the corner … and meanwhile, wreck every cre­ative industry, flood the internet with garbage, grow rich on the value of Every­thing? Let us cook—while cul­ture fades into a sort of oat­meal sludge.

They can do that! They prob­ably will. And the claim might still be true.

If super sci­ence is a possibility — if, say, Claude 13 can help deliver cures to a host of diseases — then, you know what? Yes, it is okay, all of it. I’m not sure what kind of person could insist that the main­te­nance of a media status quo trumps the erad­i­ca­tion of, say, most cancers. Couldn’t be me. Fine, wreck the arts as we know them. We’ll invent new ones.

(I know that seems awfully consequentialist. Would I sacrifice any­thing, or everything, for super sci­ence? No. But art and media can find new forms. That’s what they do.)

Obviously, this sce­nario is espe­cially appealing if the super sci­ence, like Every­thing at its foundation, flows out into the com­mons. It should.

So — is super sci­ence really on the menu? We don’t have any way of knowing; not yet. Things will be clearer in a few years, I think. There will either be real unde­ni­able glimmers, reported by sci­en­tists putting lan­guage models to work, or there will still only be visions.

For my part, I think the chance of super sci­ence is below fifty percent, owing mostly to the fric­tion of the real phys­ical world, which the lan­guage models have, so far, avoided. But, I also think the chance is above ten percent, so, I remain curious.

It’s not unrea­son­able to find this wager suspicious, but if you do, I might ask: is there any pos­sible-but-unproven tech­nology that you think is worth pur­suing even at the cost of itchy uncer­tainty in the present? If the answer is “yes, just not this one”: fair enough. If the answer is “no”: aha! I see you’ve answered the ques­tion at the top of this page for your­self already.


Where does this leave us?

I suppose it’s not sur­prising, in the end:

If an AI appli­ca­tion delivers some pro­found public good, or even if it might, it’s prob­ably okay that its value is rooted in this unprece­dented oper­a­tional­iza­tion of the com­mons.

If an AI appli­ca­tion simply repli­cates Every­thing, it’s prob­ably not okay.

I’ll sketch out my cur­rent opin­ions more specifically:

I think the image gen­er­a­tion models, trained on the Every­thing of pictures, are: prob­ably not okay. They don’t do any­thing except make more images. They pee in the pool.

I think the fron­tier models like Claude are: prob­ably okay. If it seemed, a couple of years ago, that they were going to be used mainly to barf out text, that impres­sion has faded. It’s clear their appli­ca­tions are diverse, and often have more to do with processes than end products.

The case of trans­la­tion is compelling. If lan­guage models are, indeed, the Babel fish, they might jus­tify the oper­a­tional­iza­tion of the com­mons even without super sci­ence.

I think the case of code is espe­cially clear, and, for me, basi­cally settled. That’s both (1) because of where code sits in the cre­ative process, as an inter­me­diate product, the thing that makes the thing, and (2) because the com­mons of open-source code has car­ried the expec­ta­tion of rich and sur­prising reuse for decades. I think this appli­ca­tion has, in fact, already passed the threshold of “pro­found public good”: opening up pro­gramming to whole new groups of people.

But, again, it’s impor­tant to say: the code only works because of Every­thing. Take that data away, train a model using GitHub alone, and you’ll get a far less useful tool.

Maybe (it turns out) I’m less inter­ested in lit­i­gating my foun­da­tional ques­tion and more inter­ested in simply insisting on the overwhelming, irre­place­able con­tri­bu­tion of this great cen­tral treasure: all of us, writing, for every con­ceiv­able reason; desire and action, impos­sible to hold in your head.


Did we make progress here? I think so. It’s pos­sible my ques­tion, at the outset, seemed broad. In fact, it’s fairly narrow, about this core mechanism, the oper­a­tional­iza­tion of the com­mons: whether I can live with it, or not.

One extreme: if these machines churn through all media, and then, in their deploy­ment, blow away any prospect for a healthy market for human-made media, I’d say, no, that’s not what we want from tech­nology, or from our future.

Another extreme: if these machines churn through all media, and then, in their deploy­ment, dis­cover sev­eral super­con­duc­tors and cure all cancers, I’d say, okay … we’re good.

What if they do both? Well, it would be a bummer for media, but on bal­ance I’d take it. There will always be ways for artists to get out ahead again. More on that in another post.

I also think there are some poten­tial policy reme­dies that would even out the allo­ca­tion of value here — although, these days, imag­ining inter­esting policy is a sort of fan­tas­tical entertainment. Even so, I’ll post about those later, too.

In this discussion, I set copy­right and fair use aside. I should say, however, that I’m not at all inter­ested in clearing the air for AI com­pa­nies, legally. They’ve chosen to plunge ahead into new terrain — so let them enjoy the fog of war, Civ-style. Let them cook!

To the blog home page