The imminent risk of vibe coding

10 min read Original article ↗

A few months ago I was noticing a dramatic uptick in the amount of Cursor-y code1 we were seeing go up for review on the repo I help oversee. The problem, really, is that the code had a distinct set of smells to it, but it was really tricky to articulate what the defining characteristics of these were. Was it bad comments? An unnatural level of verboseness? Tough to say. In any case, it was disconcerting but not worrying.

Now, in the year of our lord 2026, we’re still dealing with these PRs, but we have far more data points about what’s going on and what the consequences are.

Let’s go over a few core facts about augmented (agentic) coding that are true at the time of writing:

  1. LLMs will always take the shortest path to accomplishing the given prompt. If you ask the LLM to implement a feature, it will not refactor your code to make more sense if it’s able to implement the feature without refactoring.

  2. LLMs are lazy. They’ll explore exactly as much context as they think they need to solve the problem in front of them and no more. They will not hunt for other examples in the codebase and will almost never look for code to deduplicate. When faced with a large task, LLMs frequently give up after a relatively small amount of work2. Unless prompted, LLMs will often not run their own tests3.

  3. LLMs are bad at proving their work. I have never encountered an LLM that’s able to consistently write robust, exhaustive tests. The tests usually miss some edge cases, and they frequently include cases that are overly redundant. add = (a:number, b: number) => a + b might see tests for two integers, one negative and one positive number, two negative numbers, two floats, a float and an integer, NaN, probably a bunch of // @ts-expecter-errors, a call to add(1, add(2, 3)) to prove you can call it multiple times, and more4. That’s not exhaustiveness, it’s throwing shit at the wall. More complicated functions will often get fewer tests, even though they arguably deserve more.

  4. LLMs do a poor job of evaluating their code subjectively. The code they write sprawls. Similar pieces of code end up in separate files. The models don’t like to trace how code is used, with many functions ending up with huge try-catch blocks that swallow errors that should bubble up. It’s very difficult to guide an LLM to have high coding standards without enumerating every single anti-pattern that you want it to avoid5. Some patterns can’t be prompted away.

The model is working for you, but it requires hand-holding to get things right. It’s like a self-driving car: you still have to keep your hands on the wheel and your eyes open, because even though the failure rate is quite low, the cost of those failures is quite high. At the time of writing, models do not reliably produce good code in one shot.

Here’s what I’ve observed watching people use LLMs to code:

  • Developers don’t understand the code that the LLM produces as well as the code they write by hand. I know this is true because I’ve heard it from multiple people, and I’ve experienced it myself. If you didn’t spend the time to really internalize all the lines of code you’ve produced, you don’t have an exhaustive understanding of the code.

  • It now takes far less cognitive effort to produce a PR than it does to review it. The work that a reviewer is expected to do is exactly the same work that an author should be doing: understanding the code. If the author isn’t fully internalizing what the LLM wrote and the reviewer is doing their job, the reviewer is putting in more work than the author.

  • The incentive to refactor is lower. The LLM doesn’t care how messy the code is. As an author, you have to go out of your way to clean up smelly code if you want refactoring to happen. Or, you can ignore the code smells and have the LLM apply patches on patches.

  • The more patches you have the LLM apply, the sloppier the codebase gets. At no point does anything get cleaned up or fixed properly on its own; the smells accumulate and the codebase gets worse. The LLM will sooner throw up its hands at the amount of work in front of it and proceed in bite-sized chunks when nudged than take any initiative on its own.

  • The sloppier the codebase gets, the harder it is to refactor. The more work a refactor involves, the less effective the LLM is. LLMs are lazy and want to hang the Mission Accomplished banner as quickly as possible. Refactoring often means prompting every single minor change you want to see, which is tedious and hard to do, because LLMs can’t reliably evaluate code against a prompt like “Make the error handling more sensible” since it’s subjective.

This leads to the creation of some painful (nested) feedback loops:

  1. The more you vibe code, the more you vibe code.

    1. Vibe code

    2. Don’t fully understand the output, some code is worse than when you started

    3. Code lands in the codebase

    4. Making changes manually becomes harder

    5. Vibe coding intensifies

  2. The more vibe code, the harder it is to review.

    1. The more vibe code, the more patches

    2. The more patches, the more code smells

    3. The more code smells, the longer it takes to review

    4. The longer it takes to review, the more time reviewers spend reviewing

    5. The more time reviewers spend reviewing, the lower their standards will naturally become (or: the more likely they are to miss problems)

Even if only 5% of PRs introduce serious quality issues, those quality issues do and will compound, especially if there’s no ongoing effort to perform cleanup. The codebase isn’t going to clean itself up, after all, and the LLMs just aren’t a lot of help.

Reviewer time is precious. Being a diligent code reviewer is also draining, and if you’re trying to balance reviewing PRs with your own work6, you’re going to start missing things. I don’t even want to blame it on letting standards slip, if you do the same thing hour after hour, you’re going to make mistakes.

The short answer is “I don’t know how.” But I have some ideas.

As an engineer, we should be holding ourselves to high standards. “Don’t be lazy” is a bad takeaway, though, so I’ll offer some thoughts:

  • If someone else can tell which lines of code were written by an LLM, the code shouldn’t make it to a peer review. Hard stop. If code is good code, its origin is not obvious. It doesn’t need to be obvious: good code is good code. If it’s not good code, it shouldn’t be going up for review in the first place. If you can tell it’s written by an LLM, it means it’s doing something a human author wouldn’t do.

  • Time should be allocated in advance for cleanup work. Whether you’re using an LLM or not, if you never clean things up, the codebase will necessarily trend towards mess. At least if this is scheduled, it helps to avoid things getting so bad that they approach the point of intractability.

Changes need to be made to code reviews as well:

  • Stop reviewing if the code is bad. You’ve got three strikes. If I find three things wrong with your PR that are indicative of you not reviewing your own code before sending it to me, I’m not finishing the review, I’m clicking “Request changes” and leaving you to sit on the comments I gave you until tomorrow. The average engineer doesn’t have all day to sit around trying to understand code that you didn’t write and didn’t take the time to read.

  • Invest in more automated tools. I’m quite fond of Cursor’s Bugbot, which feels like the most robust automated code review offering (that I’ve used). More often than not, it catches real issues (though it errs on the pedantic side). My only criticism is that it often misses things, so you should treat it as a second line of defense. Either way, if you’re going to burn tokens writing the code, you can pay to burn token to review the code too: get a review tool.

    • Building on that, it should be policy that all automated comments should be either addressed or responded to by the time another human looks at the PR.

  • Limit PR size. If someone sends me a 4000 line PR, I know that probably wasn’t written by hand. Unless there’s a lot of autogen or test snapshots or a package-lock.json change or you upgraded the linter, there’s no way you wrote and internalized all that. PRs should have a reasonable line limit unless there’s an especially good reason.

  • Require paired reviews for PRs above a certain size. If you cap PRs at 1500 lines, any PR over, say, 750 lines should require a Zoom call or in-person code review. Walk me through it. Show me that you understand your work. It’s not my job to give you a list of comments that you copy and paste to Gemini before sending it back to me.

I’ve also gotten the suggestion that agent transcripts should be attached to PRs. That’s interesting, but I’m not sure it would be effective7. As a reviewer, the transcript doesn’t tell me the code is bad, only that I need to have more diligence. Perhaps if there was a way to know which lines were AI-written and which were written by hand, that could be a different story.

What I don’t want is stuff that feels like punishment. These are tools, you shouldn’t be shamed for using tools. Responsible LLM use shouldn’t require you to self-flagellate to get your commit into main, or at least the process shouldn’t feel like self-flagellation.

I’m not 100% sure of what to think. The cat is out of the bag with LLMs. Engineering leaders want us to use them. Engineers want to use them. I’m a firm believer that LLMs do—when used properly—make writing software a more pleasant and productive experience. “Used properly,” though, is doing a lot of work here, and taking a hardline prescriptive stance of “if it’s not an engineering best practice without LLMs, it’s not allowed with LLMs” is ignoring all of the factors outside the academic exercise of writing good code that affect how we work.

In truth, I can’t fault my coworkers who have put up some ugly PRs. I know some of my PRs are real stinkers, and I know some of yours are too, reader. Changes are governed by the needs of the business, social pressure, time, and human fallibility. Software engineers have been chided for making Not Great Decisions since the first vacuum tube was installed, and using a new, ubiquitous tool shouldn’t be cause for us to load up each others’ actions with shame.

Hopefully, model improvements over the next 6-9 months obviate these problems. Maybe Opus 4.7 is going to get self-conscious about not taking enough time to clean up the codebase. Or sub-agents will evolve enough to productively fan out large chunks of challenging work. Or maybe Steve Yegge will turn Gas Town into something that’s not so strongly steeped in hubris.

In the meantime, though, I’ll be losing sleep thinking about how to keep the wheels from falling off my repos.

Discussion about this post

Ready for more?