Shipping Quality Software in Hostile Environments

I once worked at a startup that had fallen from tech debt into tech bankruptcy. We managed to get it back on the right track, but it made me rethink the concept of tech debt and how we ship software, especially in hostile environments.

A hostile environment, for me, is any place where software engineering is seen as the implementing workforce of ideas that come from outside, with little to no autonomy. A product team or similar owns 100% of engineering time, and you have to claw back time to work on the core stuff. That includes tackling tech debt.

Here’s the definition of tech debt I like:

Tech debt is the implied cost of an easy solution over a slower and better approach, accumulated over time.

Tech debt is an API that returns a list of results without pagination. You started with twenty results and figured you’d never go over a hundred, but three years later you have thousands of elements in a 10MB response.

It’s the fragile code that everything runs through. You dare not touch it, afraid of breaking things or introducing unexpected behaviors.

It’s the parts of the codebase that nobody wants to touch. You know which part I’m talking about. Everyone has at least one. If you get a ticket saying something’s wrong with user messaging and everyone backs away from their desk and starts staring at the floor, that’s a measurable cue that it needs focused effort.

It’s the entire system that has become too complex to change or deprecate, and you’re stuck with it because no reasonable amount of effort would be enough to fix or replace it.

It’s the broken tools and processes. Even worse, it’s a lack of confidence in the build and deploy process. Breakages should not roll out. Good deploys should always deploy successfully. Good builds should always build successfully.

It’s what you wish you could change but can’t afford to.

There are a million reasons why tech debt happens and accrues. It comes from insufficient upfront definition, tight coupling of components, lack of attention to the foundations, and evolution over time. When was the last time you had the full requirements for a project before any implementation work began, and they stayed the same until delivery?

It’s when you see a pull request with a hard-coded credential or an API key and request to have no credentials in the codebase. You get pushback because there’s a cron job from four years ago with the hard-coded SQL connection string, password included.

That’s tech debt: people pushing back on good solutions. There are a million excuses for it. “We’re a tiny startup, we can’t afford to have perfect code.” “We’re still trying to prove product-market fit.” Or even: “When we make some money, things will be different.”

It may even sound reasonable in the beginning. But you don’t want your golden moment, when your business takes off, to also be the moment of complete technical breakdown.

And it’s not just about bad code, tools, or processes. It’s about people learning the bad behavior and passing it on to the new hires. If the old hires perpetuating the bad ways outnumber the newcomers, the newcomers have no chance.

Unaddressed tech debt breeds more tech debt. Once some amount of it is allowed to survive and thrive, people are more likely to contribute lower-quality solutions and be more offended when asked to improve them.

This is the broken windows theory applied to codebases. If you have a bad neighborhood, you chuck your gum on the floor because it’s already littered and nobody cares. If you’re on a nice street with no garbage and no gum stuck to the pavement, you’re going to behave. Don’t allow your codebase to become a bad neighborhood, because once you’re surrounded by bad neighbors, why should you be any different?

Over time, it becomes a vicious self-perpetuating cycle. Productivity drops, deadlines slip, and reasons are manifold. The cognitive load of carefully treading through the existing jungle to make changes is too high. Quality takes a nosedive, and you start to see a clear separation between new, clean stuff that was just shipped and older code, six months old or more, which is instantly terrible.

You end up with no clear root cause for issues or outages. There’s nothing you can pinpoint and say: “This is what we should have done six months ago to avoid this situation.”

Morale among tech staff tanks. People are demotivated from working on things that are horrible. They start avoiding any unnecessary improvements and abandon the Boy Scout rule of leaving things better than you found them.

It’s a death-by-a-thousand-cuts situation. One big pile of sadness for people who are supposed to champion new and exciting work.

Let me give you a real-life example. A regular startup, looking for the right product to sell to the right people, with a few years of unchecked tech debt. Nobody there was particularly clueless or evil.

Within days of joining them, my alarms start going off. Very nice people all around, clearly talented engineers, hiring the best from the top schools. But the tech stack and tooling is so weird I can’t wrap my head around it.

There’s a massive monolithic 10-gigabyte Git repository hosted in the office. Fast for office folks on the local network, but the company also has remote workers. Some person on a shitty DSL halfway around the world is going to have a very bad day if they need to re-clone the repo.

There’s no concept of “stable.” Touching anything triggers the rollout of everything because there’s no way for the build server to know what’s affected by any change. The safest thing to do is just roll out the universe, which means one commit takes an hour and a half to deploy. Four commits are your deploy pipeline for the workday.

Rollbacks take just as long because there are no rollbacks, really. You just commit a fix and it uses the same queue. Say you have four commits by 10 AM, that’s your pipeline for the day. At noon, something breaks, and you commit a fix immediately. It’s not going out until the end of the day unless you go into the build server and manually kill all the other jobs in the queue. Then you don’t know what else you’re supposed to deploy because you’ve just killed the deploy jobs for the entire universe.

There’s also a handcrafted build server, a Jenkins box hosted in the office, but no record of how it’s provisioned or configured. If something happens to it, the way you build software is just lost. Each job on it is subtly different, even for the same tech. You have an Android source code that you build three instances out of, but each of them builds in a different way.

No local dev environments exist, so everyone works directly on production systems. A great way to ensure people don’t experiment because they’ll get into trouble just for working on legitimate stuff.

People have to use the VPN for everything, even non-technical stuff like support and product. A VPN failure becomes a long coffee break for the entire company.

Code that has just been written is hitting the master database. There’s no database schema versioning. Changes are done directly on the master database with honor-system accounting, and half the changes don’t get recorded because people forget. There’s no way to tell what the database looked like a month ago, and consequently, no way to have a test or staging environment that matches production.

Half the servers aren’t deployable from scratch. This almost guarantees that servers which should be the same are different. You don’t know what the difference is because you have no way to enforce them to be identical. Their deployability is unknown or hasn’t been tested, so you can just assume it doesn’t work. The code review tool is a bug-ridden, unsupported, self-hosted abandonware.

Everything people use to develop software constantly enforces some limit. Outages become a daily occurrence. Individual outages aren’t even worthy of a postmortem because there’s no reasonable expectation of uptime.

Everyone is focused on shipping features. And you get that because you can’t just refactor eight years of bad decisions. You start approaching the point of rewrites, which are almost always a bad idea. And every time you skip refactoring to push out a feature and say “just this once,” it’s another step in the wrong direction.

I call this state tech bankruptcy. It’s the point where people don’t even know how to move forward. Every task is big because it takes hours to get into the context of how careful you need to be.

At the time, the infrastructure team was staffed with rebels. They were happy to work in the shadows, with the blessing of a small part of leadership, so I joined their team.

It took us over a year and a half to get to the point where it wasn’t completely terrible. We started by writing everything down. Every terrible thing became a ticket. It became a hidden project in our task-tracking system called Monsters Under the Bed, and whenever we’d have a few minutes, we’d open the Monsters, contemplate one of them, and find a novel way to kill it.

The team worked to unblock software developers and help them ship quality software. Most of the work was done in the shadows, with double accounting for time spent.

The build server was rebuilt from scratch, with Ansible in the cloud, so it could be scaled up or migrated. We now had a recipe for the build server, a codified description of how our software is built and deployed.

Build and deploy jobs were defined in code, with no editing via web UI. Since they were in code, there was inheritance. If there were differences between builds, you extended a job and defined the differences.

We split the monolithic repo into 40 smaller ones, and even that first iteration was chopped again into even smaller repos. There were three proposals for killing the monorepo with an all-or-nothing approach that would require either pausing all development for a week or cutting losses, losing all history, and starting fresh.

Instead, we built an incremental approach. Split out a tiny chunk, paused development for a single team for an hour, and moved them to a new repo with their history intact. Infrastructure went first, showing the path to other teams. We set up a system where changes triggered a build and deploy only on the affected project. Commit to live was measured in seconds, not hours.

Some teams initially opted out and stayed in the monorepo. They joined a few months later, after seeing what the other teams were doing.

All servers were rebuilt and redeployed with Ansible. This used to be some 80 machines with 20 different roles. We did all this under the guise of upgrading the fleet to Ubuntu 16. Nobody understood what that meant or asked how long it would take. Whenever someone asked about a server whose name had changed, we’d just say: “Oh, it’s the new Ubuntu 16 box.”

In the background, we wrote fresh Ansible to deploy a server that kind of did what we needed it to do and iterated on it until it could actually do what needed to be done. Then we killed the old hand-weaved nonsense and replaced it with our Ansible solution.

We migrated to modern code review software and away from self-hosted Git hosting to GitHub.

VPN was no longer needed for day-to-day work. You only had to connect for the master database, and nobody had write access anyway.

We created local dev environments. No more reviewing stuff that didn’t even build because people were afraid to touch it. No more running untested code against production. There was now a code review process for SQL scripts, and a method of deploying them that kept dev, test, and production databases in sync.

There are two morals to this story.

One: don’t wait for permission to do your job. It’s always easier to beg for forgiveness. If you see something broken, fix it. If you don’t have time to fix it, write it down, but come back when you can steal a minute. Even if it takes months to make progress, it’s worth doing.

The team here was well aware of how broken things were, but they thought that was the best they could do. It wasn’t. If we had pushed the change as a single massive project, one that would take a year and a significant number of full-time engineers to have any measurable impact, it would never have happened. The company simply couldn’t afford that. Instead, we turned a small team into a red team and just did it.

Two: it should never have been like this. This is not a playbook on tackling tech debt. It was a horrible way to do it, even if we did manage in the end.

A team in the company was directly subverting the established processes because the processes were failing them. And the managers were giving us the thumbs up and protecting us while pretending to the rest of the world they didn’t know what we were doing. That’s not how things should work.

Situations like that happen because tech debt work is very difficult to sell. It’s an immeasurable amount of pain that increases in unmeasurable ways. And if you put in some effort to tackle it, you get unmeasurable gains.

Even the name, tech debt, implies we have control over it, that we can choose to pay it down when it suits us, like a credit card. But it’s not like a credit card where you can make a payment plan. With tech debt, there’s no number. It’s a pain index with no upper bound, and it can double by the next time you check your balance.

It’s incredibly difficult to schedule work to address tech debt because nobody explicitly asks for it. Everyone wants something visible and measurable, and tackling tech debt directly takes time away from shipping features.

But to quote a cleaning equipment manufacturer: “If you don’t schedule time for maintenance, your equipment will schedule it for you.” Things that are not regularly maintained will break at the worst possible time.

I came across an article that changed how I think about this: Sprints, Marathons, and Root Canals by Gojko Adzic. His argument is that software development is neither a sprint nor a marathon, which is the standard comparison.

Both sprints and marathons have a clearly defined end state. You run them out and they’re done. Software development is never done. You only stop working on it because of a higher-priority project or if the company shuts down.

His point is that you don’t put basic hygiene on your calendar, like showering or brushing your teeth. It just happens. You can skip it once, but you can’t skip it a dozen times, at least not without consequence. Having to schedule it means something went horribly wrong, like going in for a root canal instead of just brushing your teeth every morning.

Translated to software development, this is sustainability work. It’s not paying down tech debt. It’s making software development sustainable, so you can keep delivering a healthy amount of software regularly.

Instead of pushing for tech debt sprints or tech debt days, sustainability work needs to become a first-tier work item. Like brushing your teeth, you can skip here and there or delay it for a bit, but the more you skip, the more painful the lesson will be.

Agree on a regular budget for sustainability work for every team or every engineer and tweak it over time. It’s a balance, and there’s no magic number. You don’t have to discuss it with people outside the engineering team. Engineers will know which things they keep stumbling over and what their pain points are. Over time, you can discuss the effects and whether they need more time, or if they can give some back.

This approach doesn’t only improve your code. It improves morale. There’s nothing worse than being told you’re not allowed to address something that’s making your life miserable.

The teams I’ve seen make progress aren’t the ones who ran a tech debt sprint. They’re the ones who stopped asking permission to brush their teeth. Call it whatever you want. Just stop scheduling it as a special occasion.

Sprints, Marathons, and Root Canals by Gojko Adzic — the article that reframes software development as hygiene, not a race
Broken windows theory — the criminological theory applied here to codebase quality and behavioral contagion
The Boy Scout Rule — from 97 Things Every Programmer Should Know; leave things better than you found them

This essay was originally published on chaos.guru in 2024. It has been slightly updated for current writing style.