Deep Dive: Testing a Microservice Architecture with QA Environments

Press enter or click to view image in full size

Here at Opendoor we have a microservice architecture. That is, our product is provided by multiple services running on the backend. Generally this works well, but it also introduces some challenges. For example, it makes it challenging to run integration tests, or go through customer flows during development.

It’s a problem that many tech companies face: How do you make it easy to test changes as the company and software architecture grows in complexity? Some companies will hire a QA team. This is an excellent way to prevent issues from reaching customers, but QA engineers need an environment in which to test. The company may adopt a release train so QA can test a large batch of changes before releasing them. There’s a downside though, as this can slow development. If an engineer misses the release window, they have to wait for the next one. At other companies engineers can release independently and test in a shared staging environment, but this can lead to conflicts, things breaking, and also slow velocity. There are many ways to approach the problem, and various levels of investment that can be made to solve it. But every company with a growing software system will have to address it at some point.

At Opendoor, we set out to make it easy for engineers to test complicated changes independently, and early in the development lifecycle, which ultimately increases velocity. We’ve further enabled testing by our cross-functional partners like product management or design. Reducing the burden of testing allows our engineers to focus more on creating quality products for our customers. Continue reading to learn how we implemented QA environments to enable this and more.

The Problem

Testing in a microservices world can be challenging. We’re big proponents of the testing pyramid, so we use unit tests as much as possible. However, some changes are hard to test with unit tests alone. For example: when making changes across multiple services, when making major changes (like a recent major upgrade of our Rails version), or when making changes to a service configuration. There are going to be times when you just want more confidence than unit tests alone give you.

We ran into some obstacles when it came to testing these kinds of changes:

Some tests require multiple services running. It can be hard to know how to run all these services, and it can turn a developer’s laptop into a space heater!
In order to test, you need realistic data. While we have realistic data in our staging environment, tests can potentially mutate that data and “corrupt” it for other tests — meaning someone ran a test that left some data in an unexpected or broken state.
Production environments differ from local testing in networking, configuration, user data, etc.
Those outside of the development team like QA teams, designers, or product managers found it basically impossible to test backend changes before we actually merge them.

So, we created QA environments to solve these problems.

QA Environments

QA environments are essentially mini, isolated versions of staging. But instead of running every service in each QA environment, you can run just the services you need. And instead of there only being one environment, as with staging, you can have as many QA environments as you like, on demand. Each can have a different version of each service, and each service in them can have its own independent database.

More concretely, a QA environment is a collection of pods, services, ingresses, and other resources in Kubernetes, that collectively work together to create a group of services that can communicate with each other. We’ll go into more detail on this later.

From a developer’s point of view, whenever they create a pull request (PR), they can also create a QA environment associated with it to test their changes in an environment similar to production. They specify what services they want to deploy in the environment along with the PR for the changes they want (or even multiple PRs, since not all services are in the same git repository).

The QA Concierge (QAC 🦆)

Creating a QA environment has several steps, and we want to automate this. Hence, we created an internal service called QAC (pronounced “quack”) to facilitate it. QAC has a UI, and also an API that can be called from continuous integration (CI). Hence, users can go into the UI to create an environment for themselves, or CI can create an environment automatically to run tests.

QAC also plays a part in request routing, which we’ll talk about later when we get to another supporting service, QARP. We’ll describe how QAC works in detail in a bit. But first, we need some background about our tech stack.

Tech Stack

Here’s our tech stack at a high level, as far as it’s relevant to this article.

We deploy our services with Kubernetes using Helm. We build Docker images in CI for each service. We have a staging environment that we deploy to once code has been merged into the main branch. We run some automated integration tests in staging before deploying to production.

Generally, all external dependencies for a service are configured via environment variables, roughly following Twelve-Factor principles: For example, each service with a database stores the URL in an environment variable, usually called DB_URL. Similarly, if service A calls service B, then service A will have the hostname of service B’s load balancer in an environment variable, usually called SERVICE_B_HOST. Storing configuration in environment variables is awesome because it makes it easy to deploy the service in different environments without updating the code or Docker image. This will be crucial later. Environment variables are configured as part of the Helm chart, and each environment (production, staging, and QA) will be able to set these separately.

Some of our services use Kafka. Some use Redis for background jobs (e.g., Sidekiq in Rails, or Celery in Python). These are also configured via environment variables.

We have realistic data in our staging databases. We run a weekly ETL that copies data from production to staging, and scrubs email addresses and other sensitive data. This is important because it allows realistic testing data in staging. Unfortunately, it introduces the possibility of temporarily corrupting staging data as well.

For frontend code, we build frontend assets (minified JS and CSS) and upload them to a content delivery network (CDN). Requests from the frontends get routed to an AWS load balancer via DNS. These then get routed to the correct internal service via NGINX ingresses we have running in Kubernetes. The routing at the NGINX level is generally based on the URL path or subdomain. In practice, most frontend requests end up going to our GraphQL service, which runs in Kubernetes like everything else. The GraphQL service then routes to other backend services. Here’s a diagram to make it more clear.

Press enter or click to view image in full size

This is our tech stack in a nutshell. Now let’s dive into the details of how we implemented QA environments!

How QA Environments Work

Let’s list out all the steps involved in the lifecycle of a QA environment. We can then go into them more in detail.

User Actions

First, a developer makes one or more PRs for some changes.
If the developer wants a QA environment to test their changes, they go into the QAC UI. There, they can create an environment and tell QAC what services to deploy.
For each service the developer wants to deploy, they provide a link to the PR with any changes, or they can say to deploy the service off the main branch instead.

QA Deployment

Once the user has specified the services, if the user requested independent databases for any of them, QAC will start the process to provision them. If the user didn’t request independent databases, then the default is for the services in QA to connect to the staging database.
Once databases are provisioned (if requested), QAC will load the Helm charts for each service. It has a simple plugin model for finding and loading Helm charts, to make it easy to support different types of services (e.g., all services written in Go are handled with the same plugin, as they all follow the same conventions).
Since all our services are configured via environment vars, QAC will update the environment vars in the Kubernetes resources (rendered from the Helm chart) as needed to run the service in QA.
QAC also does some other processing, like adding labels to all Kubernetes resources to indicate which environment they’re in (each QA environment has a numeric ID). It also adds node selectors and tolerations so QA services run on dedicated Kubernetes nodes (we don’t want them competing for production resources).
If the user requested independent databases, QAC also inserts the new URLs for them into the environment. It also runs any migrations for the service. How to run migrations differs by service, which QAC again solves via a plugin model. We’ll discuss that more below. Each migration ultimately results in a Kubernetes Job that gets applied and waited upon.
Once the Kubernetes resources are ready, QAC simply applies them in Kubernetes.

Running in QA

Now that the QA services are running in Kubernetes, we need to get requests to them. This is done by setting two cookies in the user’s session. One is used for canary routing (described below), and the other includes the numeric environment ID. The QAC UI has a button that will generate and set these cookies when clicked. This makes it easy for anyone, not just the developer, to “enter” a QA environment at the click of a button.
For each of our services running in staging, we’ve set up an NGINX canary ingress. The canary ingresses are deployed alongside the normal ingress for each service. They’re identical to the normal ingress, except they include an annotation with a cookie name. This causes NGINX to check if that cookie exists, and if so, route to a different backend. Otherwise it will route to the normal ingress backend (which would be the staging service).
When the cookie is set, the canary ingress routes it to a custom reverse proxy we built, called QARP (QA Reverse Proxy). This is a Caddy server with a custom reverse proxy module, making the code we need to maintain pretty minimal.
When QARP gets a request, it will extract the QA environment ID from the cookie. Additionally it extracts the backend service name from a header that gets added by the service’s ingress. It then calls QAC’s API to check if the given backend service is running in the QA environment matching the ID. If so, QARP will forward the request to the QA service by changing the host to the local cluster address. If not, it forwards the request on as-is to staging. This setup allows us to route to either staging or a particular QA environment based solely on some cookies developers can easily set.

That’s a lot to take in, so here’s a simplified diagram to show the above flow:

Press enter or click to view image in full size

Now that we’ve listed out the steps at a high level, let’s go into more detail about some of them.

The QAC UI

To make it easy for developers to create QA environments, we made a UI. The home screen lists out all the active environments, with the name of who created them.

Press enter or click to view image in full size

Clicking on an environment will take you to the details page, which we’ll describe next. You’ll also note an auto-cleanup time for each environment. To avoid wasting resources, each environment will automatically delete a week after creation. There’s a button on the details page to extend the auto-deletion if you need it for longer. The last thing you can do on the homepage is to create a new environment. It will ask you for a name, and then take you to the details page for it.

Here’s what the details page for an environment looks like.

Press enter or click to view image in full size

On the details page, you can see which services are deployed in the environment (initially none), or select which services you’d like to deploy. We have a dropdown for which services are available, which makes the UI easier to use. You select the service, and then enter the PR URL for the changes you want to apply. Technically we support deploying off a branch or with a given Docker image, but these are mostly used for debugging by core developers. The one exception is deploying a service on the main branch. This is useful if you haven’t made any changes to the service, but need it in your environment. Lastly, you can also check a box to get an independent database.

Once you select all your services, there’s a button to deploy them. At this point you sit back and relax while they deploy. Once they’ve deployed, the UI will show a list of pods with their respective Kubernetes status (e.g., “Ready”), and links to Splunk logs for the pod.

You can also set cookies to “enter” your environment as described above, as well as delete the environment altogether once you’re done using it.

Note that although QAC has a UI, it also has a well-defined API. This is important because we designed it to be used by both developers and CI. CI will use the API to automate running integration tests in QA environments. This avoids race conditions and data pollution that you can get by running pre-merge integration tests in staging.

Provisioning Databases

The user can request an independent database for their service running in QA. This enables testing migrations, or running tests that mutate data without corrupting staging. Tests in CI especially benefit from this to avoid race conditions and flakiness from mutating the same objects.

To create these independent databases we use Aurora. Because we use Amazon RDS for all our databases, Aurora makes it easy to create relatively fast clones of them. We run a job every weekend that makes snapshots of our production databases, and then runs transformations to scrub sensitive data like emails. We then take a second snapshot of this and create an Aurora cluster and database instance from it. At the same time we also delete any old clusters. This is all very slow, so needs to be done offline, prior to when a developer directs QAC to create a database clone. When QAC needs an independent database, it finds the cluster for a given service, and “restores” the cluster, which creates a new one, and creates a database instance from it. This is “relatively” fast — taking about 10 minutes — though we hope to make it faster in the future (see Alternative Designs at the end of this article).

N.B. Aurora databases might count towards RDS quotas in your account, so be careful with this approach.

Rendering Helm Charts

All our services define their Kubernetes resources via Helm. These resources define what gets deployed, along with load balancers, etc. QAC needs to modify these resources in order to inject independent database URLs, other service hostnames, labels, etc. The Helm charts are a template for Kubernetes resources, and the template parameters are different for each environment (e.g., QA, staging, production).

For staging and production, we deploy via CI. That is, CI will call Helm to deploy the service. For QA, QAC will do this (although there’s an alternative we considered that I’ll describe below). QAC doesn’t use Helm to deploy directly because it needs to modify the resources before deploying. Hence it runs helm template to generate the templated YAML resources, which it can parse and patch.

Generating the resources differs slightly depending on the service. Most of our services are written in either Go, Python or JavaScript (node), with one major service being written in Ruby. Each of these is slightly different to deploy, so QAC has a plugin system. It will detect the type of service being deployed, and has handlers for each to generate the resources. This is easy to extend for any future additional cases (we’re using the generic procedure pattern from the excellent book Software Design for Flexibility). The handler basically returns two things: 1) the Kubernetes resources for deployment, and 2) the Kubernetes resources for running database migrations. The latter is typically a Kubernetes Job that runs the command to run migrations.

Patching Kubernetes Resources

Once QAC has generated the Kubernetes resources from the Helm chart, it needs to make some modifications.

First, QAC makes sure every resource has a unique name in Kubernetes, so there are no conflicts. Then it adds a qa_env label to every resource, where the value is the numeric environment ID. This makes it easy to search for all the resources belonging to an environment. E.g., deleting an environment is mostly just running kubectl delete -l qa_env=N for each resource type.

QAC then adds nodeSelector and tolerations to the deployment specs. This is so pods run on nodes dedicated to QA. This is an implementation detail that depends on how the Kubernetes cluster is set up. But in general, QA services should run on dedicated nodes, and not compete for resources with staging or production.

Finally, QAC performs the most complicated modifications: patching environment variables. We do this via variables in the Helm chart that QAC interpolates. I’ll describe the two most important ones.

First is ${qa_db:STAGING_DB}. QAC will parse the values of all environment variables and look for this string (generally it looks for anything of the form ${qa_XYZ:STRING}). In this variable, STAGING_DB is the URL of the staging database. It’s used as the default if the user has not requested an independent database for this service. In that case the variable will simply resolve to this substring. If the user has requested an independent database, however, then QAC will replace this variable with the URL to the newly provisioned database.

The other is ${qa_service:STAGING_HOST}. For example, you might have ${qa_service:user-service}. The value depends on whether the service (in the example, user-service), is running in the current QA environment or not. If it is, then the variable will be replaced with the hostname of the Kubernetes Service for the QA service. If not, then the variable will be replaced with the staging hostname. QAC can determine whether or not the given service is running in the current QA environment by looking at all the Kubernetes Service resources in that environment to see if the name exists.

Finally, QAC also processes Kubernetes Ingresses. It performs variable substitution in ingress hostnames, similarly to patching environments. The main variable that’s used in this case is ${qa_id}, which resolves to the numeric environment ID. This allows hostnames to be unique per environment. Actually, they must be, otherwise a QA environment would try to use the same hostname as another, or even worse, a staging or production hostname. In fact, to avoid these dangerous collisions, QAC will remove any Ingress resources in which the hostname doesn’t include the string qa-N (N here being the environment ID).

Applying Migrations and Deployment

Once the Kubernetes resources have been patched, they can be applied. Each service actually has two sets of Kubernetes resources: one for running database migrations, and one for deploying the service. QAC applies the migrations only if the user requested an independent database. We want to be sure that QAC never accidentally runs migrations against the staging database, as that could break it for other developers. To prevent this QAC ensures that the ${qa_db} variable was used somewhere in the Kubernetes resources (both migration and deployment). This prevents errors where someone didn’t set up their Helm chart for QA correctly, and hard-coded the staging DB instead of the QA variable.

The migrations are applied first, and QAC will wait for them to finish (the migrations must be a Kubernetes Job that can be waited for). Then it applies the deployment resources. If all goes well, then the QA environment is successfully deployed!

Routing Requests to QA Environments with QARP 🐟

To “enter” a QA environment, the user sets two cookies (the QAC UI provides a button to make this easy):

qa=always
qa_env=<env_id> (this is actually a JWT to provide security against non-Opendoor employees setting these cookies)

The request gets sent to an AWS load balancer (via a CNAME record), which sends it into our Kubernetes cluster. There, an NGINX controller will pick it up based on Ingresses we set up. Each Ingress has a matching canary ingress, which is defined by the following two annotations:

nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-by-cookie: "qa"

The way this works is that we have a normal ingress rule that routes to a backend service, and also a corresponding canary ingress. For example, we have a normal ingress that matches

host: athena.opendoor.com
path: /

and sends the request to our GraphQL server. We also have a canary ingress, which is identical, except that it has the above two annotations, and sends the request to a special backend service called QARP.

The normal ingress also sets a header with the name of the normal backend service to which it would send the request. This also applies when the request is routed to QARP.

These two ingresses are easy to keep in sync because they’re part of our standard Helm template.

These canary ingresses ensure that any requests with a qa=always header are routed to our QA Reverse Proxy (or QARP, pronounced “carp”). QARP is a Caddy instance with a custom module. Caddy allows custom code to run dynamically on each request as a “matcher”. It will route the request to one of two destinations depending on whether the request matches what the module is looking for. The module calls QAC with the QA environment ID (which QARP gets from the qa_env cookie), and the backend service name (which the ingress set in a header). QAC can quickly look up whether that backend is running in the given environment, and return a boolean response. QARP can then route the request to the QA backend if it exists or on to staging if not. The QA backends have predictable names (basically service-name-qa-N), so QARP can easily construct them.

Press enter or click to view image in full size

Note that requests for purely frontend assets (JS and CSS) don’t go through Kubernetes. We store these in S3. We also have a way to use PR-specific frontend assets that pair with, but are a different implementation from, QA environments. These are called deploy previews and you can read the deep dive about them in another blog post.

Keeping Environments Up-To-Date

Deploying some services when a QA environment is created is all well and good, but we need to keep it up to date. Since each service in the environment is associated with a PR (or the main branch), keeping them up to date means redeploying when new commits are pushed to those PRs (or main branch). I mentioned above that services could also run off an arbitrary branch, but this is mostly for debugging purposes, and automatically redeploying these services is not supported.

We leverage our existing CI system for this. As commits are pushed, the CI jobs run. We simply need to add a CI job to call QAC to tell it to redeploy.

Press enter or click to view image in full size

Cleaning Up Environments

We noticed that developers often forget to delete their environments, which ends up wasting resources. Hence we have two ways to help with this. First, when all the PRs associated with an environment are closed (merged or otherwise), the environment is deleted. Second, we delete environments automatically after one week, regardless of whether the associated PRs were closed. This second approach is because developers sometimes make PRs just for testing, and forget to close them. However, we realize sometimes developers do need environments longer than one week, so we also allow developers to postpone the auto-deletion time, or opt out entirely.

Deleting environments that are over a week old is pretty straightforward. We have a job that runs on a timer (hourly), and looks for open environments that are past their auto-delete time. Deleting environments when all PRs close is a little tougher, since we don’t have CI jobs that run in this case. To deal with that we set up a GitHub webhook so we can get notifications whenever PRs are closed.

Alternative Designs

Independent Databases

There’s an alternative to using Aurora to clone databases that we considered. The idea is to run Postgres on an EC2 instance with a btrfs or ZFS filesystem. The idea here is that you can copy data from an RDS database to a Postgres server on EC2 using Amazon DMS. Then turn that server off, we only need the data files on disk. When we want a new independent database, we can use the copy-on-write (CoW) feature of btrfs or ZFS to make a copy of the database files, and start a new instance of Postgres, on a new port, using this copied data. Because this is CoW, it’s extremely fast and storage efficient. Even for a ~10TB database, it only takes a few milliseconds. And the amount of inserts and updates we do for testing typically only takes an additional few kilobytes.

We did a proof of concept of this, and it appears to work perfectly. However, the team decided that using Aurora was probably a better thing to start with because it’s an off-the-shelf solution and more engineers on the team were comfortable with it. Either will work, but if we were doing this again we would go with the CoW implementation. We had to completely refactor QAC to support creating Aurora instances, because it takes long enough that everything needs to be done in a background task. Plus adding 10 minutes to environment spin up is quite significant. Later we discovered postgres.ai, which basically does exactly what’s described above, so that would be worth investigating as well.

Deploying

QAC deploys environments by rendering Helm charts, making changes, and applying the resulting Kubernetes resources. There’s a bit of overlap with our CI/CD system here, which also renders Helm charts and applies them. Hence, another approach we thought about was to let CI/CD do all the Helm / deployment logic, and then set up a Kubernetes mutating webhook to change the resources just before they get applied.

This is elegant in theory, because QAC doesn’t need to know how to render Helm charts. However, QAC still needs to know what other services are running in the same QA environment as the one being deployed (it needs to know this to resolve ${qa_service} variables). The way it knows this is by rendering the Helm charts and looking at Service resources. Hence, it would need to render Helm charts anyway. The team was also less familiar with Kubernetes webhooks, so we abandoned this idea. It might be a better way to do it if QAC had some other way to know which services were in an environment (e.g., if all services were named in a particular consistent pattern so QAC could infer it).

Learning Experiences

We learned a few interesting things by working on this project:

Standardization is leverage. The more consistent your infrastructure is, the higher leverage you get by writing tooling for it. Luckily, almost all our services use Helm and are deployed to Kubernetes. If some services used a different tool, and some were deployed to, say, bare metal EC2, GCP, or Heroku, then more and more code would be needed to support those services.
Running tests in CI is important. We built QA environments initially for manual testing, but once complete, the main request from our developers has been to support integration and smoke testing in CI. Automated tests that run on every PR can catch edge cases that developers could miss during manual testing. You should strive to automate as much as possible for developers to speed up their workflow and release stable code on a quicker cadence.
Storing configuration in environment variables is important. This makes it really easy to deploy the same Docker image in different environments. E.g., we have a Rails app that stored gRPC service hostnames in a YAML file. This is much messier to deal with, because QAC would need to have knowledge about how that specific application is written to modify the hostnames if those gRPC services were running in the same QA environment. (We just ended up moving these hostnames into environment variables to solve this.)

Conclusion

QA environments are non-trivial. Three engineers (

, , and myself) spent about six months tackling tricky Helm templating, solving ingress routing, modifying pipelines, setting up multiple new services, etc. When built thoughtfully, though, the final solution can be fairly robust and easy to extend.

The initial work to set this up pays off when it allows developers to test their pull requests before merging in a way they never could previously. Designers, product managers and QA teams can now interact with changes before they go live to customers. Automated tests can avoid flaky failures from mutation race conditions with another test. It’s possible now to have a safe, independent environment to debug a tricky issue that doesn’t show up locally without risking breaking staging for other devs. QA environments give developers confidence in the important things: releasing faster with fewer bugs.

We hope you found this article useful and helpful if you’d like to build this kind of infrastructure yourself. If you’d like to work with us on this kind of project here at Opendoor, we are hiring!