With last week’s AWS outage, the Great Cloud Debate has been reignited in full force on sites like Hacker News, but with little nuance. As someone who has managed multiple datacenters and public cloud footprints, both technically and financially, as well as the tech teams that run them, I would like to share a few points that recast this near-religious argument of good vs evil into a more useful decision-making framework.
In the tech world, we’re accustomed to the build vs buy decision-making process. The traditional wisdom is to focus on building the things you have a core competency in, and buying the things you don’t. For example, build your app, buy your ticketing system. Build your app, use third-party libraries. Build your app, put it behind a commercial CDN. It makes logical, economic sense to specialize in what you’re best at.
Like most decisions cast as binary debates, it’s important to remember that you can do both — across the layers of your tech stacks, you can and should both build and buy. There is no reason not to perform this same type of analysis with your infrastructure. For example, a large database cluster might fit better on-prem, whereas you should almost certainly not build your own email system, but rather pay a provider such as Office 365 or Gmail.
You can outsource your infrastructure to a cloud provider, but you still have to build on top of it. This is certainly faster and easier in the cloud, especially for a startup, but the work nevertheless must be done by someone, somewhere. In context of this article, I use “build” to reference investing in on-prem infrastructure, and “buy” to refer to paying a public cloud.
For obvious reasons, no modern startup should consider anything except a public cloud. On the opposite end of the spectrum, if you’re an established company and have exabytes of data and a high volume of traffic, you will probably need a few SANs. It’s the middle area, where most companies reside, that is full of confusion and misinformation. For the purposes of this article, let’s assume you already have colo space and are trying to figure out if you can get rid of it.
Let’s get one thing out of the way: nobody saves money by doing a direct lift-and-shift. Keeping hundreds or thousands of long-running, stateful ec2s constantly powered on will always cost more than a 2-node VMware VSAN cluster. To utilize a public cloud smartly, your app needs to be cloud native, which means that if it’s running on servers today, it needs to be re-architected (more on this later). If you can’t do that, you’ve already lost the cloud game. You absolutely will pay more, for both compute and bandwidth, than you currently do for your colo cage and all its associated costs, and you will still need server administrators on staff.
But before we dive into a technology cost comparison, let’s review those staffing costs and strategies, by using NoOps as a starting point. If you’re fuzzy on the history of this term, it was coined by Adrian Cockroft at Netflix (which, incidentally, also has both significant cloud and on-prem presences, meaning they have made the analyses and landed on different answers for different parts of their infrastructure). In a blog post, Cockroft argued that developers should be empowered to do everything end-to-end, thus eliminating the need for bureaucratic operations departments. He was rebutted by John Allspaw of Etsy, noting that ops work still exists at Netflix, it’s just spread around amongst the developers, to which Cockroft conceded. You can read about it here, but the simple logic is this: As soon as your developers are doing 40 hours of ops-style work per week in aggregate, you should hire one DevOps engineer to take that workload off of them. Most developers find ops work to be mundane and disinteresting, whereas your DevOps engineer will have sharp tools in her bag and likely enjoy the work she’s chosen to specialize in. She’ll implement standardized, automated processes that can benefit multiple development teams, such as consistent pipeline tooling and common deployment strategies. As the company grows, the next best non-development technical hire might be an SRE — another individual with specialized skills that will take the workload of reliability engineering off the developers. It’s been almost ten years since Cockroft’s NoOps blog post, and everyone nowadays seems to understands that even in pure cloud environments, somebody still has to do the ops work, the SRE work, and the security work. (Do you trust your developers with security?) Meaning, of course, that even in the cloud, you need at least some non-developer, ops-ish people.
This is usually where public cloud evangelists chime in: “But to be in a datacenter, you have to do a lot more ops work and need a lot more people! You need a server guy! And a network guy! And maybe a storage guy! And actually teams of all three! And it won’t even be as good as Azure!” And this is, of course, all true. The cost required to build a private cloud is substantial, and it may not be “as good” as a public cloud, at least by some definitions. By using a public cloud, you externalize those staffing costs to your provider’s pooled resources. And when you add all those salaries to the cost of hardware, power, bandwidth, and rack space, you have a significant outlay for something that you know will be less scalable and less efficiently-utilized than a public cloud. Those datacenter taxes sound pretty high, don’t they?
However, that doesn’t mean that being in a datacenter must necessarily always cost more than being in the cloud. At economies of scale, your own bare metal, spec for spec, will be much less costly than the equivalent cloud compute resources. One common complaint is that we pay for CPUs in our on-prem servers that we know will never be fully utilized. But unless your app is completely serverless, you’re also doing that in the cloud. Outside the T series, AWS doesn’t oversubscribe CPU (which VMware will happily let you do whenever and wherever you want), which means they are passing on the cost of those idle cycles to you, with a markup. In fact, two ec2 vCPUs are actually just two hyper-threads of the same single core, so you’re getting even less performance. If you don’t like the idea of paying for idle CPUs in the datacenter, you really shouldn’t like the idea of paying even more for less capable idle CPUs in the cloud. “Only pay for what you use” applies to auto-scaling groups, network traffic, and native services, but not lift-and-shift ec2s. Everything should autoscale, and almost every stateful, long-running ec2 should be on the chopping block. Otherwise, you’re just using the cloud as a really expensive virtualization service.
And so, you will spend time poring over idle resource usage reports, spend time trying to figure out what to do with the findings and how to prevent them from popping up in the future, and spend time performing the monthly “why is my AWS bill spiking this month?” dance. These are all things you just don’t have to worry about, or pay people to do, on-prem, so let’s call them cloud taxes.
Optimizing for a cloud-native, auto-scaling architecture is its own science, and also requires time and expertise to do properly. Building a horizontally scalable distributed system is, by necessity, more complex than simply scaling up, which is less of an option in the cloud. Despite being almost a dirty word, the advantages of being able to scale up when necessary almost can’t be overstated:
For Basecamp, we take a somewhat different path. We are on record about our feelings about sharding. We prefer to use hardware to scale our databases as long as we can, in order to defer the complexity that is involved in partitioning them as long as possible — with any luck, indefinitely.
No sharding? No eventual consistency? More capacity with no additional architectural redesign work? These could be huge advantages if you need to scale in a hurry. Now that the NoSQL fad is over, there are clearly still many use cases for traditional relational databases with ACID adherence. With RDS still maxing out at a woeful 16T for Oracle and SQL Server, you may eventually have no choice but to redesign your data model and possibly your entire app to use Aurora or Redshift. Maybe you’ll even use an abstraction layer and enjoy the side-effects of added complexity and slower performance.
With more moving parts, and a tech stack made up of diverse services (many of which are black boxes), it’s less likely that any one individual understands everything from end-to-end, making outages more difficult to troubleshoot. I have no love lost for monolithic java apps, but all else being equal, designing and maintaining simpler systems will always cost less than complex ones. This is another cloud tax that is often overlooked.
Another common problem in the development world is premature optimizing, which Spotify has deemed “the root of all evil.” Do you trust your developers not to fall down the rabbit hole of shiny object resume padding? Do you really need microservices for your CRUD app when a standard 3-tier will do? With no hard limits in the cloud, the temptation to over-optimize your tech stack into 87 microservices running on 87 different AWS services is all too real.
Vendor lock-in, or at least extreme friction, is also a very real problem. Kinesis is fantastic, but if you’re not running it in Kafka compatibility mode, you’ve now married yourself to an AWS-only architecture. If you want to run your app on another cloud, you have no option but to re-write at least some portion of your code. And the idea of single-use, multi-cloud Terraform code is still a pipe dream, so you will have to refactor at best, and at worst, run separate IaC codebases for every cloud you want to use.
Another joy of the cloud landscape is that it changes so fast (is there a single person who even knows the names of every AWS service?), you basically need up-to-the-minute guidance from your Solutions Architect — as if the frenetic pace of change in the tech industry in general wasn’t already enough. It is really easy to do things wrong.
All the gotchas I described above aren’t just cloud taxes, they’re hidden cloud taxes. You’re not measuring how much time your staff spends wrangling all the complexities of a cloud or two or three. This work is likely being spread amongst just the people you don’t want it to — developers. And back to the NoOps example, you’re now diluting your talent. Depending on your perspective and requirements, having an infrastructure team that knows and manages your own independent, isolated stack from top to bottom, who can troubleshoot issues at every layer of the OSI model, could carry less risk than running on a third-party black box. Perhaps you could also derive value by allowing your development teams to code features and bug-fixes instead of learning infrastructure technologies, designing complex systems, or hunting down idle ec2s.
As is now evident, even in a pure cloud environment, you still need people to do the ops work. You may need a few less people, but you still need significant domain expertise, especially when you start increasing complexity. The savings you get from a few less datacenter-oriented salaries are at least partially offset by the additional workloads taken on by the developers, who will do it worse. It’s not a zero-sum game, but it’s in the ballpark, and if you’re not careful and disciplined, an unpredictably variable monthly AWS bill due to tinkering can hit you by surprise.
The best case scenario for a cloud app is that you have a well-architected, auto-scalable application that relies as much on cloud-native services as possible, don’t have too much live/hot data, traffic, or ec2s, you don’t care too much about vendor lock-in, and you have discipline amongst your teams. This is Doing it Right, and this is when you outsource at least some portion of your infrastructure to a public cloud. But the more of those boxes you uncheck, the less value value you’re deriving from that spend. At some point, you will cross a threshold, at which point you would be saving money by paying the datacenter taxes as opposed to the cloud taxes. This is when you build.
Press enter or click to view image in full size
A few non-technical considerations: Is AWS, or at least the Northern Virginia us-east-1 region, now Too Big to Fail, in the vein of Bank of America and JP Morgan? Is your app hosted next to GovCloud, a possible terrorist target? How do you feel about paying a company that supports a sister company that allows its employees to die in warehouses from tornadoes and COVID-19, aren’t allowed to carry cell phones, and can’t take bathroom breaks? How about the idea of contributing to a budding monoploy? For all our love of distributed systems, there are a lot of eggs in a single basket.
You may not care about some of these issues, but depending on your corporate culture, some of your employees might. Some of your clients might as well. Amazon Retail has a lot of competitors, and not all of them are keen to pay their vendors to support the continued growth of a juggernaut.
Public clouds are a fantastic resource and can certainly save you money when used properly. But if you can’t fully commit to a cloud-native architecture, and if you are undisciplined and allow your developers to be swamped with work that prevents them from specializing in their core competencies, you’re gonna have a bad time. Buy when it makes sense. Build when it doesn’t.