It’s been a long time since I first started using Terraform to spin up infrastructure in AWS. Back then AWS was launching EKS and I was trying to wrap my head around it while also trying to do the Kubernetes the Hard Way setup using Flannel as the CNI before I had figured out much about how containers work and before I knew pretty much anything about networking. Since then, Kubernetes and DevOps in general, has improved massively with tools that allow you to test, build and ship at crazy speeds and scale. We are already seeing the advent of DevSecOps with artefact signing and provenance and SBOMs being integrated into pipelines everywhere.
* All the points I’m making are also relevant to OpenTofu, I’m not aware of any fix for any of these issues in that system either
But then I think back at Terraform and the process of using it and I’m kinda wondering why we are still doing things the same way as back then? My impression is that we are battling with the same problems and frictions as 8+ years ago, even though Terraform has matured quite a bit as a system itself and it’s a lot more flexible, stable and capable than ever before.
What were the operations like back then, something like 3000 years ago, before AI and pandemics and stuff?
(this is how I felt writing this section of the article, except the hair which is not my strong point but I couldn’t get the LLM to generate a better pic)
Well, let me think back for a second. So, I remember doing a change like trying to create a NAT gateway and get it connected to a subnet that was supposed to be publicly available. After a lot of Google searching, I figured out that I need to set up routes and NACLs, got all the HCL written down, did a couple of plans until everything looked fine and raised a PR. Since I was the only person who (kinda) knew Terraform, my PRs where auto-approved by the rest of the team, so I just merged everything to master (yeah yeah it was a long time ago). It was a small team and there was no need for us to be compliant to any strict policies, so we never got around to automating things like producing a plan using the CI pipelines when a PR was raised. I had unfettered access to the production environment to apply my changes. A wrong command from me or any engineer in my position with sausage fingers due to low caffeine levels in their bloodstream could mess up the Terraform state. And messing up your Terraform state is a great way to drastically increase your heart rate and get some aerobic exercise without ever leaving your chair! (not medical advice, I’m a gym bro, don’t trust me)
In another position, a couple of years later, we introduced the CI automation while I was part of the company in order to achieve PCI compliance. From a cyber security standpoint it’s a no-brainer that not everyone should have access to the entire state of a production SaaS cluster. Not because you don’t trust your own engineers, the vast majority of platform/infrastructure/DevOps engineers don’t want to mess with client’s data and are not malicious. But if they have direct access and their laptops get compromised, then a malicious actor will have the keys to the kingdom.
Locking access to the Terraform state and the cloud provider APIs only to runners of the CI system is the equivalent of shooting your foot while running a marathon when it comes to productivity. Now, every small change that you make has to be pushed to your Git host and you need to wait for the CI runner to spit out a plan in the logs of the of test run.
The somewhat long feedback loop of writing a change, doing a plan, waiting for a state sync and update and then checking the plan output and starting over until everything looks good, suddenly became a lot longer because CI systems are always under some pressure. There is no way there’s a CI runner waiting just for you to push your changes, you are going to be added to some queue and eventually the slow process of state sync/update will start and the plan output artefact will be uploaded to some storage and then you will be able to see it.
At this point it would be good to ask yourself, what happens if something breaks with your CI system. You are supposed to just sit on your hands and wait for it to resolve before resuming work? What if the CI infra is also defined in Terraform? How do you investigate and apply the fixes? Or, what happens if something breaks in the actual cluster and some human or AI agent needs to go and do some investigation?
You have spent all this effort to ensure only your CI has permissions and now you need to find a hack to allow again your human operators to access the state and the cloud provider APIs. But you only need to allow this hack when there are exceptional circumstances and engineers should lose that access when its not needed.
Once you have figured out your Terraform access issues for your infra engineers and SREs, you probably need to figure out how to give some limited access to your SWEs who might need something small like a database or a message queue and they don’t want to wait for you to finish fighting fires before they can test their new feature in the development cluster. And many teams have decided that Terraform is too much of a hassle, so instead they use Crossplane for these limited changes. Now you have two tools with completely different configuration languages and permission systems modifying your clouds.
If I had a time machine and went back in 2019 and asked a platform engineering team back them what are their issues with Terraform, I’m pretty confident that they would mention a lot of these same RBAC related issues. They would also mention how difficult it is to refactor Terraform state or import changes into the state after a session of debugging and manual fixing of an incident. Or how slow the state updates can be when the code base contains a lot of resources. Or how difficult it is for non-experts to digest the output of a plan that has any drift.
I just have this gut feeling that there is something we are missing in this picture, it doesn’t necessarily need to be a completely new system, because I do like the fact that we can store the ideal state of our cloud in a git repository and review changes before they hit production.
