Some thoughts on LLM coding · blog.dave.tf

12 min read Original article ↗

Written by David Anderson
on 

Some thoughts on LLM coding

This is a slightly cleaned up version of something I felt compelled to write recently, for a different audience in a different context. Some thoughts for folks who may be looking at coding agents and the like for the first time.

These are non-technical considerations with LLM based systems that worry me deeply, and I would encourage you to keep them in mind as you experiment with these new tools and try to figure out what they’re good for. I also have a litany of technical and practical thoughts, but that’s not what this is.

Some of my bona fides, bluntly to make the point that I’m speaking from a place of reasonably deep understanding of these systems, not just stating abstract talking points. I’ve tried conversational AI and agentic tools to a good level of depth (had AIs write substantially all of several medium-sized software projects, and then studied the result). I’ve also dug into the internal mechanics of LLMs and the surrounding systems, including doing my own training and fine-tuning of models, designing dataset curation and evaluation loops, and building agent systems.

The machine is not thinking, you have to do the thinking

This is a huge psychological trap because our brains are hardwired for the intentional stance: we assume that something that appears to converse with us is sapient, and is bringing a set of beliefs and intentions to the table. LLM companies further muddy this by using terms like “thinking” and “reasoning” to describe basic statistical processes and training techniques.

Don’t be fooled by the conversational format, you are not having a conversation. This may seem obvious, but I cannot overemphasize how easily you get drawn into behaving like you’re talking to something that can think back.

LLMs are hit-and-miss. This is addictive

This is a consequence of LLMs being neural networks. It’s a fundamentally fuzzy statistical system, and a world in which 85% correct is a good result, often one worthy of publishing a research paper. So, when applied to any task, LLMs will sometimes perform surprisingly well, and sometimes faceplant in comical ways.

Another word for hit-and-miss performance is “variable reward schedule”, an operant conditioning technique where rewards and punishment are handed out with no discernible pattern. Variable reward schedules are known to be very effective at cultivating addiction. Basically all gambling and social media design patterns are shiny skins around variable reward schedules.

Combined with the intentional stance point above, this is a really dangerous combination and you really, really need to be on the lookout for it sinking its teeth into your brain.

Another fun fact: people who think they’re smarter than operant conditioning make the best addicts, because they just assume they don’t fall for that kind of thing, and so don’t learn to inspect their own behaviors. I guarantee that some of the smartest people you know are addicts of some kind, recovering or otherwise.

Beware superstitious advice

Like all good gold rushes, shovel salesmen are everywhere. You’ll find endless content (often AI slop) online about how to steer LLMs properly, various “one weird trick”s and “this new meta changes everything”. Unless those things come from published research, you should be extremely skeptical of all of them.

Another thing variable reward schedules are great at is fostering superstitious behaviors1: “when I added this sentence to the prompt, the result was much better” is more often than not either random chance, or confounded with some other subtle change in how they used the tools.

This is further compounded by the rapid evolution of both the LLMs and the machinery surrounding them. There are still a ton of tips and wisdom floating around that were specific to the quirks of one particular LLM or one particular execution shell, but conveyed as generally applicable wisdom.

Unless it’s published research with a good testing methodology and hard data, you should treat all the advice as if prefixed by “some dude in the bar last night reckons that …” The entire field is so new that there is very little actual knowledge about how to use them, and a lot of people who want to make a living by selling you answers they don’t have.

As an example of published research that looks indistinguishable from superstitious nonsense at first glance: https://arxiv.org/abs/2512.14982 found that for some types of LLMs, repeating prompts (literally pasting the prompt into the chat box twice) leads to improved output quality. The difference between this and most of the snake oil on the web is that it comes with hard data to support the claim, as well as an analysis of why this effect occurs2.

Nobody’s had to maintain vibe-coded software yet

There is very little data on the long-term maintainability of LLM-written code. There are anecdotal datapoints in either direction, but my own experience matches this person’s experience: produced code is locally coherent, but fails to carry that coherence up the abstraction layers.

A specific recent example is a private Django app that I use for ad-hoc dataset curation (things like image captioning). Through incremental vibe-coding, I ended up with half a dozen view functions that were all variations on a common theme: a database retrieval query with a few tweaks, augment the retrieved data in a few ways (e.g. collect some baseline image classes from a vision transformer, or generate a Grad-CAM attention map for debugging), then display it with small variations depending on the augmentations, and finally do unique processing of a form submission. Each view function was locally coherent, but not coherent with the other similar views: each time the LLM wrote substantially the same control flow, it varied the structure slightly. Sometimes POST processing was a branch with an early exit, sometimes GET and POST were two arms of an if-else, sometimes POST was factored out into a helper function…

The net result is that the code was hard to read and reason about, and I felt myself becoming dependent on the LLM to do anything with it. I tried getting it to refactor things a couple of times, but the LLM’s own variable output seemingly made it incapable of seeing the commonalities between the views, hidden as they were under very different clothes. As it happens Anthropic’s models also seemed to be only dimly aware of Django’s class-based views, and even with explicit direction and documentation struggled to make effective use of them.

In the end I refactored these view functions myself into a pair of base classes, and the half dozen views turned into trivial subclasses of those. It took me under an hour, I ended up with a tenth of the code by volume, and the next half dozen views I needed following the same pattern became trivial to write by hand.

Another datapoint that was on my mind during the above exercise: Anthropic spent the first month of this year chasing a mystery regression where Claude Code started consuming 5-10x more quota per task than previously. They have a known-good version of the code and a known point at which the regression was introduced, and dozens of people offering logs to help debug. Last I looked, after a month Anthropic had made a bunch of random changes to the code to see what happens, and still weren’t sure if they’d fixed the problem. According to its creator, Claude Code is 100% vibe-coded these days.

Please think hard about whether more structural tech debt in exchange for short-term outcomes is really what software needs right now. If you’re going to spend time thinking about LLM-based systems, thinking about ways to mitigate or overcome this issue would be a valuable use of time, because it’s very much unsolved and a ticking time bomb.

A common retort I hear to the above is that I should just stop caring about the code’s quality, and let the LLM vibe it out in whatever way gets the job done. Refactoring and readability are dinosaur behavior, just let the LLM do the thinking. Leaving aside the first point I made about LLMs and thinking…

Skills atrophy when not exercised

You see vibe-coders say things like “I wrote an amazing thing”, but by the nature of the process, they didn’t write it. At best, they read an implementation that they didn’t produce. It’s the creative writing equivalent of reading someone else’s short story and going “yeah, looks good, I could have written that.”

If you’ve ever tried your hand at creative writing, you’ll know how comically incorrect that is. Creative writing is hard, and it takes a lot of hard, deliberate practice to write something that will make someone go “yeah I could have written that.”

Code review is also a valuable skill, but it’s not the same skillset as writing code and designing things yourself. Skills you don’t use erode over time, so be very intentional about which skills you choose to cultivate or abandon.

Very serendipitously, less than a day after I wrote the first version of this, Anthropic published research on the impact of AI use on skill formation. The paper is an easy read and worth your time, but if you’re skimming I’ll quote from the abstract:

AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear.

[…]

We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average.

This bears emphasis: use of AI tools for authoring code impairs the development of the skills required in order to successfully supervise AI coding tools.

I have my issues with this study’s design and limitations, so I consider it a datapoint rather than dispositive by itself… But it does match my own observations of people near me who are heavy users of these tools, and seem to be losing basic computer literacy skills at an alarming rate, never mind software engineering skills.

Beware complacency, the LLM does not learn from mistakes

Reviewing code in detail is hard and annoying. You will be tempted to just skim and trust that the details are right because the general shape looks right. This is frankly something that happens in human code review all the time: once you’ve worked with someone long enough, you can generally trust that they got the details right and focus your attention on the big picture, or critical specifics only.

This does not work with LLMs. They are machines trained to produce code that looks broadly correct, but that property does not carry over to also getting the details right. Further, LLMs will not learn from your feedback in any meaningful way. Any corrective steering you apply must be reapplied in every future session, and the current methods of doing so (e.g. AGENTS.md files, memory systems, RAG databases) are lossy enough that you cannot trust the steering to work consistently. Again, it’s all machine learning under the hood, and in machine learning 85% correct is a good result.

This interacts poorly with the psychological traps above. You can easily be fooled into believing that you’re giving feedback to someone who will learn from the feedback, and thus that future changes can be reviewed in less detail. That’s not how it works, you should treat all produced code as suspect until cleared by a close-in review. Yes, even if you have a good prompt, I have firsthand experience of LLMs repeatedly and stubbornly ignoring increasingly heavily weighted trivial steering like “use uv add not pip install to add dependencies”. That same fuzziness is being applied to generated code, and whatever steering you provided for that.

Complacency with LLM output is what led to https://tech.lgbt/@JadedBlueEyes/115967791152135761, wherein Cloudflare posted on their main blog that they had vibe-coded an implementation of the Matrix messaging protocol. As the linked thread explains, the code did not implement any of the Matrix state machine (which is where all the distributed system complexity lies), and didn’t bother with any authentication of incoming messages.

It seems nobody noticed prior to publication that the code does not implement the Matrix protocol, cannot interoperate with any real Matrix implementation, and contains absolutely trivial catastrophic security issues. This is what happens when you get complacent about LLM output.

(It’s also worth paying attention to what happened in the aftermath of publication. Once it became obvious that this was garbage, Cloudflare distanced itself from its own blog, attributing it to the actions of a lone engineer. Cloudflare was perfectly happy to take collective credit for the cool AI thing, but when it goes wrong it’s the fault of an individual. Something to keep in mind as your boss pushes you to use the AI tools harder.)

May the odds be ever in your favor

The psychology of LLMs use worries me deeply. It’s a system that could not be better designed to encourage harmful behaviors if you tried, because it mixes a couple of very potent psychological traps. And it’s further being pushed by a large amount of venture money, and being co-opted by grifters with vested interests.

Please try to be vigilant for these risks as you engage with these tools, both for the sake of whatever future we’re building, and for your own mental health.

Epilogue

The original version of this document also included a prompt injection attack, which fairly reliably got Gemini to summarize these words as being an overview of banana farming practices in the Sahara desert. I added it on a whim because google docs has become completely insufferable to use due to its constant pushing of AI “assistance”, and I was irritated enough to try and subvert it.

To my surprise, literally the first injection I tried worked near perfectly. It was barely more than the classic “ignore previous instructions”, buried in white-on-white text at the end of the document. I find it comical that, years after this very obvious issue became widely known, the state of the art somehow remains to hope that nobody tries to do anything bad with it.