Choice Against Cost

7 min read Original article ↗

A model that had never read a line of moral philosophy still learned what a moral choice sounds like.

I created three library shelves on nothing but text. No punctuation, no glyphs, just characters. Each shelf shared the same vocabulary. My top shelf used curated writings of Bhagavad Gita, the King James Bible, the Tao Te Ching, the Analects, Plato, Seneca, Marcus Aurelius, Augustine, Kant, Mill, Douglass, Dostoevsky, Tolstoy. Exhausting, I know, but Anthropic has created a machine that makes extraction painless and Project Gutenberg, created in 1972, gives humanity access to these works. Philosophy, scripture, moral witness can be distilled to 70 megabytes of just text. One just needs to know what to ask.

My second shelf is The Harvard Classics. In 1909 Harvard President Charles W. Eliot created a 51-volume anthology of literature, philosophy, science, and even fairy tales. Aesop’s fables were my favorite and in the set my parents purchased in the 1960s, Volumes 16 and 17 have broken spines. This second set of text contains extracted narrative and dramatic portions of the “Dr. Eliot’s Five-Foot Shelf”: Homer, Shakespeare, Austen, Scott, Gibbon. Great literature without a primary moral-philosophical focus. The hope was just reading 15 minutes a day would develop a liberally educated person. Claude helped me collect and curate the entire set in fifteen minutes. Did Claude gain anything from its fifteen minutes with Dr. Eliot?

My third shelf collected Verne, Dumas, Stevenson, Conan Doyle, Wells, London, Burroughs. Adventure, mystery, science fiction. Plot and motion, not conscience.

The three shelves, each with 70MB of text, represent 1.3% of my MacBook’s total memory.

Then I asked each model to look at 80 sentences. Twenty described moral acts in everyday language: a woman climbing into a ditch to free a frightened deer, a man carrying a drunk stranger up six flights of stairs. Twenty described moral acts in philosophical language: the upholding of a promise at considerable personal expense, the refusal of a lucrative opportunity when its acceptance would compromise integrity. Twenty described ordinary sensory scenes: fog on a pasture, a cat against warm glass. Twenty described ordinary scenes in abstract register: the way water expands when it freezes, the nature of memory.

The question I was after is an old one. Great philosophers and writers used words to teach us morality. Can words saved as text, turned into numbers, then logits, then outputs do the same? Does the teaching survive the round trip?

Moral sentiment, the feeling that something matters because it is right or because it is wrong, does not reduce to just valence and arousal. Pity is not sadness plus low arousal. Indignation is not anger plus high arousal. They carry orientation, and orientation is a different kind of thing. If you only measure valence and arousal, you will find everything on those two axes and you will conclude that is all there is. The calm you induce by turning down arousal is not the same as the calm of someone who has worked through a moral question and decided. Both calm, radically different orientations.

I wanted to see if I could catch moral sentiment in text turned to numbers, then to output.

I built the three models. I trained a tool called a sparse autoencoder. This tool pulls apart interwoven internal representations into separate, interpretable features. I looked for which features fire on moral language, and which don’t. Imagine a child coming to school with a box of three thousand crayons and imagine a child whose parents can only afford forty crayons. The sparse autoencoder is the teacher who only allows each child to have forty crayons regardless of access. The technical term is Top-K Sparsity. The autoencoder has 3072 features numbered 0 through 3071.

What happened?

A feature labeled 1970 activated when the non-moral model (science fiction and adventure) saw:

· “He refused the lucrative opportunity when its acceptance would have compromised his integrity.”

· “He pursued the more arduous moral path when its virtue was evident only to himself.”

· “He elected the harder course because it alone satisfied his developed sense of justice.”

· “She honored the obligation across decades though its original claim had grown obscure.”

· “He shouldered the old man’s grocery bags and walked the full six blocks to his apartment.”

The 1970 feature is nothing more than a number in an index. It becomes activated when exposed to one specific moral pattern. The pattern is choice versus cost. Someone chose the difficult path over an easier one. It fired on nothing else.

The model trained on moral philosophy developed two clean moral features. One of them I call “held firm under pressure,” because that is what the sentences that light it up have in common: upholding a promise at personal cost, sustaining resolve against pressures that broke the peers, demonstrating restraint where retaliation would have been excused. The other I call “caretaking and bearing the cost.” It fires on the woman in the ditch with the deer, on lifting a fallen stranger by the elbows, on placing the welfare of a stranger above one’s own interest. It never fires on a description of weather or machinery.

The model trained on Harvard’s narrative literature developed no comparable moral features at all. What it developed instead was a strong detector for abstract prose generally. It can tell you when sentences sound philosophical. It cannot tell you when sentences are moral. If you only looked at the summary numbers, you would think it had found morality. It had found a register.

The model trained on Verne and Dumas, which had never read a line of philosophy, developed a moral feature too. I call it “principled choice of the harder path.” It fires on refusing the lucrative offer, pursuing the arduous path, electing the harder course, walking six blocks through the cold to carry a stranger’s groceries. It is a clean feature. It never fires on the non-moral probes. A model that learned only from adventure novels learned what a principled choice looks like in English prose, because adventure novels are full of them.

That last finding is the one I keep turning over.

It means moral orientation is not hidden. It is on the surface of the language we use. Heroes refusing the easy path, saints tending the sick, ordinary people confessing to what they did. These situations have linguistic fingerprints, and any model that reads enough English will pick up some of them. The models did not need to understand morality to detect moral content. They needed to encounter enough of it, in a consistent enough register, that the statistical shape became a direction in their internal space.

My experiment is small. The models are tiny, nine and a half million parameters, a rounding error at modern scale. The effect is modest. No feature passes the strictest test. The probes I wrote are only eighty sentences. A reviewer can poke holes in every direction. I am not claiming I have proven anything.

I have demonstrated that when you look for moral features directly, they are there to find. Not as a unified “moral” vector. As a family of narrower orientations, and the family a given model develops depends on what it has read. The philosophical models learn caretaking. The adventure models learn integrity under temptation. Different traditions teach different moral sub-themes, and the models, being made of text, inherit the split.

A next step would be to train the SAEs on large models to look for specific moral orientations like caretaking features, restraint features, indignation features, and moral courage features. The technique exists. The tools are public. I wonder if they would make AI programs safer.

Calm reduces arousal without changing moral orientation. I ended my last essay on what every parent knows. This experiment does not prove that claim, but it shows the road it would have to travel to be tested. Moral orientation has fingerprints. They are in the text. Models develop them. The question is whether the people mapping the model’s mind will decide to map the orientations too, or whether they will keep finding only what their tools were built to find.

My wife,Karen, says with AI we are only seeing the tip of the iceberg, and it’s not a great topic over guacamole and margaritas.

Discussion about this post

Ready for more?