Enforcing best practice on self-serve infrastructure with Terraform, Atlantis and Policy As Code

Here at loveholidays we are heavily dependant on Terraform. All of our Google Cloud infrastructure is managed using Terraform, along with a number of non-GCP services as well.

Terraform has widely become the de-facto tool for infrastructure provisioning, and for good reason. It’s superbly simple to set up and get started, easy to provision infrastructure, and incredibly well supported by the folk at Hashicorp.

The difficulty with Terraform: everyone does it slightly differently. This isn’t limited to comparing usage across companies either; it’s not uncommon to have teams deploying and managing infrastructure completely differently within the same company. One team may have a full CI/CD pipeline with approval stages and remote-managed state, the next may be manually running Terraform from one engineer’s laptop, because “the infrastructure is devops’ job”.

Our goal here at loveholidays is to empower developers to own their infrastructure, from creation through to deprecation, ensuring compliance with all of our company standards, requirements and best practices. All with as little input as possible from the infrastructure team.

Harnessing only open-source tools, we have built an end to end pipeline that centrally manages Terraform, and enforces our best practices using Policy as Code.

Structuring Terraform at loveholidays

We use Terraform to manage the following systems:

and more. We separate these into dedicated repositories, with multiple repositories for our GCP infrastructure. We also maintain a number of Terraform modules we have created internally, which enforce our best practices (required labels, regions etc). Each internal Terraform module has a dedicated GH repository, and we release them with git tags.

At the time of writing, we have 33 repositories for various Terraform related purposes. It’s crucial that we have a consistent (and fast) way of managing deployments for all of these repositories.

Deploying with Atlantis

After a brief experiment with Github Actions and Google Cloud Build, we settled on Atlantis for deploying Terraform directly from Github.

Shamelessly stolen from the runatlantis.io home page; the following image explains the workflow for deploying with Atlantis:

Press enter or click to view image in full size

source: https://www.runatlantis.io/

We have deployed multiple instances of Atlantis on a GKE cluster, with each having a different set of credentials depending on the target system (GCP, Route53, Fastly, etc).

Each of our repositories containing Terraform is configured to send webhooks to a public URL, i.e. an Atlantis instance. This triggers a Terraform plan inside the Atlantis deployment in GKE, and auto-comments on the PR with the result, as seen above. Once the PR is approved, an atlantis apply comment on the PR will trigger Atlantis to apply the Terraform changes, close the PR and delete the branch.

During a Pull Request’s (PR) lifecycle, Atlantis maintains a “lock” on a PR, so users cannot try to apply or merge conflicting PRs whilst there is an open PR related to an already locked Terraform environment. Read more here.

Although we have 33 Terraform repositories, we do not need 33 Atlantis instances. We group/match repositories to Atlantis instances based on the credentials required to run Terraform for that repository.

Atlantis is well documented and has a thriving support community over on Slack. It has drastically simplified the way we approach Terraform pipelines and has been an excellent choice of tool for us.

Authenticating Atlantis

Atlantis is, by design, secure. It runs within your own infrastructure, and with credentials you provide it.

When running in GKE, the Atlantis instance will, by default, run as the service account associated with the Kubernetes node. Depending on how you have configured your cluster / node pool, this could be:

The ‘default compute Service Account’ associated with the host project
A custom Google Service Account (GSA)
No Service Account (no credentials)

Another common option would be to mount a Google Service Account JSON Key into the container, and authenticate using that JSON Key.

We are anti-JSON keys here (these are long-lived credentials, not automatically rotated, and can easily be leaked from your organisation), and do not use the default compute Service Account for anything in our production environments.

Relying on the GSA associated with a node is not great practice either. We do not want to have dedicated nodes inside our cluster for different services, nor do we want non-Atlantis workloads to be able to authenticate as the highly-privileged GSA that Atlantis will run as, simply by being scheduled on the same node as Atlantis (remember: Atlantis can create and delete infrastructure as part of the Terraform lifecycle, so protecting its credentials is imperative).

So how do we do it here?

We have opted to use Google’s Workload Identity to associate GKE Service Accounts with Google Service Accounts. To put this simply, any request to GCP from a GKE pod will be authenticated as the Google Service Account that is “bound” to the Kubernetes Service Account.

Each instance of Atlantis has a dedicated K8s Service Account, which is “bound” to a dedicated Google Service Account. This gives us completely sandboxed sets of credentials for each instance of Atlantis.

With this we have the ideal world of no JSON keys, and any pods running on the same node as Atlantis will not inherit a privileged Service Account.

Workload Identity is a complicated topic and not the scope of this article. Keep an eye out for a deep-dive article into Workload Identity and how we created a custom controller to provision Google Service Accounts coming soon!

Restricting public access to Atlantis

Atlantis receives webhooks from your VCS / Github repository on the/events path. An easy and common approach is to publish a host name A record pointing to the load balancer serving Atlantis, with a wildcard path:

Press enter or click to view image in full size

A simple wildcard host configuration

The problem with this, as we discovered recently, is the / homepage:

Press enter or click to view image in full size

Home page of a running Atlantis instance

Notice the “Disable apply commands” button? This is a public page without authentication, so anyone that stumbles upon your Atlantis instance can disable apply commands, i.e. stopping production infrastructure changes. This isn’t obvious at all, and it would be a real head scratcher to work out why Atlantis suddenly stopped working!

This global ‘disable apply commands’ feature can be disabled in Atlantis’s settings, but it may be a useful feature to have internally. We opted to only publicly expose the necessary path for the webhooks:

Press enter or click to view image in full size

Exposing just the /events path

and Admins can still access the global disable functionality with akubectl port-forward to the Atlantis pod/service. Simple.

This set up with Atlantis has worked very well for us so far, and has been in production for the past year.

Policy as Code

As developers are now empowered to self-serve their infrastructure, we are losing sight on what is actually being built. Are our custom modules being used? Are resources created in the correct zone / region? Are we exposing GCS buckets publicly?

Without assigning ourselves as code owners on all Terraform repositories and gatekeeping all Pull Requests, we have little visibility of what infrastructure is being provisioned every day. Having an infrastructure team that oversees every decision does not scale, and just serves as a bottleneck where users are more likely to ask for us to do something, rather than feeling empowered to do it themselves.

Get Dan Williams’s stories in your inbox

Join Medium for free to get updates from this writer.

We decided that a better approach would be to build compliance directly into our Terraform pipelines. We carried out a series of Proof of Concepts on a number of open source Policy as Code tools, and eventually settled on a pattern that we have rolled out to Production and is helping us keep our infrastructure secure (and compliant) today.

What is policy as code?

Hashicorp define Policy as Code very well here:

Policy as code is the idea of writing code in a high-level language to manage and automate policies.

In the same way we define our infrastructure as code using Terraform, we want to write policies in code that enforce our best practices on Terraform. An example of this would be to ensure every resource has a valid team label, and the corresponding value of the team label is a valid team within our organisation.

What did we look for in a tool?

We evaluated the tools based on the following criteria:

Well supported, with a strong community or a proactive development team
Allows writing custom policies (some tools only have built-in policies for CIS benchmarks)
Easy to integrate into our existing pipelines
Easy to pick up and learn
Easy to write new policies
Ability to test our policies with unit tests

What tools did we trial?

There is a fantastic article written by Cristophe Tafani-Dereeper that compares each of these tools in depth, and does an excellent job of explaining the different types of code analysis. We used Cristophe’s work as the base for our testing, and would recommend reading this if you are also interested in starting this journey yourself.

What did we learn?

First and foremost, there is no one size fits all solution, and definitely no correct answer.

It was apparent very quickly that the built-in checks that are packaged with these tools cater more towards AWS than GCP. We also realised quickly that the capability to build our own custom checks was far more important than having a comprehensive suite of tests built in.

We found that static code analysis (that is, analysing Terraform code without producing a plan) was not particularly reliable, often missing resources inside nested modules, or not correctly populating variables from tfvars files. Analysis on a plan produced by Terraform produced far more reliable results.

It’s also worth noting that some of these open source tools are a compliment to a paid SaaS product, so some key features for us would be locked unless we opted to subscribe. We love open-source here at loveholidays, and give preference to open-source first tools.

Which tool won?

We eventually settled on using Fugue’s Regula as the framework for writing our custom checks. Regula is a library written on top of the Open Policy Agent engine, utilising the Rego language to write policies.

A terraform plan command can output a JSON file to your local filesystem, containing the resources in your configuration and the planned changes Terraform wants to apply. This json file is a very complicated, nested structure with many layers of resources. When using the Open Policy Agent engine directly, or an abstraction such as Conftest, it relies on the user to navigate and loop (“walk”) through this complex file, which is no small feat and prone to mistakes.

Regula provides a series of helper functions written in Rego, which do the heavy lifting of navigating the complex Terraform plan file. This means the focus can be on using Regula’s functions to write simple policies, rather than fighting the plan file in each policy.

We found the Fugue team to be extremely responsive and helpful over on Github. A special thanks goes out to them for the excellent product they’ve built.

The product has seriously evolved since we started using it, and now includes a full CLI tool written in Go. Currently, we are only using Regula’s core Rego library, however the CLI and other recent features look promising.

Integrating into our pipeline

Once we had a number of policies written using the Regula framework, we had to work out how to actually incorporate these into our deploy pipelines.

We use Conftest to validate Kubernetes Manifests elsewhere in our stack, so we had some in-house knowledge of this already. Conftest also uses the Open Policy Agent engine, so it can use the libaries provided by Regula and execute any policies written in Rego.

Crucially, Atlantis also supports Conftest policy checks as part of its “plan” workflow.

So now, we have Atlantis running a Terraform plan, passing that plan file to Conftest, Conftest then pulls the Regula framework and our custom policies written in Rego, and comments the output to the PR, all in one single action:

Press enter or click to view image in full size

Example of a failed policy check ( GCS bucket in the wrong region)

Press enter or click to view image in full size

Example of a successful policy check with no violations

Each policy is documented in a markdown file in our policy-as-code repository, and linked to as part of a violation warning as seen above. This means that, when a developer finds a violation in their code, they can go to our documentation and see exactly why their resource violates any particular rule.

We are still fairly early in our journey with Policy as Code, and our policies change frequently. Due to this, we download the policies from our policy-as-code Github repository on each plan via Atlantis, which means every Terraform Plan is evaluated against the latest set of policies we have released. This allows us to create/modify policies, and release them to production, extremely quickly.

Press enter or click to view image in full size

Basic workflow for Github, Atlantis + Conftest

Testing our policies

Our policies are kept in a dedicated Github repository. Each policy file has a corresponding <file>_test.rego and <file>_infra.tf file, where the .tf file defines a set of Terraform resources that pass and fail a policy, and the _test file sets the pass/fail result for each Terraform resource against the policy.

We then utilise a combination of Regula and Conftest to execute unit tests against each policy.

These unit tests run via Github Actions as part of the Pull Request process for adding or changing a Policy, so we can be confident that every policy passes or raises violations as we would expect.

Press enter or click to view image in full size

A failed Github Action run when adding a new policy

Example Policies

The following are some example policies we have incorporated here at loveholidays:

Ensure resources have a ‘team’ label, with a correct value that is a real team at loveholidays
Ensure resources are not created outside of the EU
Ensure we are only using whitelisted external Terraform Modules
Ensure uses of our loveholidays’ custom Terraform Modules meet a minimum version requirement (if we release a mandatory module upgrade, we can bump the minimum tag in the policy and force users to upgrade on a subsequent Terraform plan)
Ensure resources have all required labels based on resource type
Deny public GCS buckets
Allow only Google Groups we manage in Terraform to appear in a project’s IAM list
Deny any use of over-privileged roles such as project owner

Most of our rules will look at the contents of all resources in the Terraform configuration on each plan, so if we add a new policy, it will raise all existing violations in the code base.

The other style of policy would be to only look at the resources present in the planned changes for a particular PR. For example, if a new firewall rule is added that exposes port 22 for a particular VM, we could violate a rule that says “fail when a VM will be exposed on port 22”, which will require a predetermined owner to approve the changes.

Atlantis has a concept of Github owners. This group of approvers can type atlantis approve on a PR following a policy violation, which will enable the requestor to then comment atlantis apply and proceed with the PR. On a subsequent PR, this firewall rule would not raise a violation, because it will not appear in the planned changes of that plan.

This enables us to catch changes to the infra as they are being applied, but not falsely report them on every plan.

For our static policies that evaluate all resources, where we do have existing violations, we utilise Regula’s waivers feature to make special exceptions / whitelists against a particular resource ID and policy ID.

There’s a lot of information to take in here, but hopefully this serves as a good first insight into how we are operating a secure, self-serve infrastructure pipeline at scale, with only a minimal infrastructure team.

Keep an eye out for future articles on a technical deep dive into how exactly we have pieced together Regula + Conftest + Atlantis, as well as how we are using Workload Identity to drive forward a world without JSON keys.

We’re hiring for a Senior or Intermediate Infrastructure Engineer at loveholidays. If the above article resonates with you and you want to help us build great stuff on Google Cloud, reach out to us using the links above.

Alternatively, you can see all of our open roles here: https://careers.loveholidays.com/