Deployment: What good looks like

Introduction

Deployments might be done to introduce code with new functionality or to address a defect. But deployments are also one of the most critical operations a service experiences — moving from a known state to an unknown state. A good deployment pipeline will reduce the fear and stress of going into the unknown by providing the right tools and processes to the operators of the service — leading to happier developers and operators, leading to higher quality services.

Press enter or click to view image in full size

This document aims to describe, without going into implementation details, what features a deployment process should have to enable teams to deploy with confidence.

The main areas of improvement are believed to be centered around adopting canary deployments and improving the deployment process to allow for faster rollbacks during deployment, coupled with increased monitoring.

Press enter or click to view image in full size

Each service in Klarna is unique and at different maturity levels, and deployment pipelines require different features. This document only describes the features needed to reach a good enough state. It’s also important to point out that what’s outlined here is not an all-or-nothing process, but should rather be seen as a reference process with good practices that could be added to a service deployment pipeline over time.

By continuous improvement in each service deployment pipeline and increased usage of shared tooling, each service will eventually reach the goal — to deploy with confidence.

To track that we reach the end-goal, two KPIs have been defined in Klarna to measure improvements done to deployment pipelines.

Service Level Objective (SLO). An SLO is a measurement of the performance of a service towards a target value. The SLO violation rate is compared to the service error budget. In Klarna the centralized infrastructure metrics platform is used for SLO tracking.
Mean Time To Resolution (MTTR). The time from a failure occurs to a corrective measure that has been applied. In Klarna, the incident ticket disturbance duration attribute is used to capture this.

Definitions

Terms and definitions used throughout this document;

Deployment. All changes that mutate the service, for example, infrastructure, software, or configuration. Deployment does not have to change the behavior of the service.
Release. A change to the service that alters its behavior. The key distinction between deployment and release is that a release changes business logic.
Blue/green deployment. A blue/green deployment maintains two instances of a system. One live system that serves traffic, the green system, and another, blue system, that is ready to serve traffic. During deployment, the new, blue, system is provisioned and traffic is shifted from the green system to the blue system. Having two systems live at the same time allows for fast shifting of traffic in both directions.
Canary deployment. A canary deployment means that a new version of a component is introduced for a small part of the workload to validate that it behaves correctly before taking on the full load. Canary deployments can be used together with the blue/green deployment approach.
Rolling forward. Instead of reverting or undoing a deployment, a change is applied to the service, configuration, or data schema to address a problem that occurred after or during a deployment.
Synthetic traffic. Artificial traffic that is generated towards a service to exercise flows. It can be done to warm up a service or to test for correctness.

Deployment Best Practices

The best practices described here are meant for deployments in changes to an existing codebase, for example, introducing new code or configuration. Migrating between different codebases, deployment stacks, or major breaking changes should be treated as an exception to the normal deployment flow and extra attention.

To achieve KPIs, SLOs, and MTTR, a best practice deployment mechanism should;

Be automated to minimize human errors. The process can include manual gates if needed — that is, human approval is required to proceed to the next step.
Use the blue/green deployment pattern to allow for fast traffic shifting.
Use canary deployment with fine-grained traffic control (possibility to shift a percentage of traffic).
Minimize the changes included in each deployment to make it easier to understand what is different and reduce the number of potential failure scenarios.
All changes to a service, including changes to the configuration, should be introduced using the same code deployment principles.
Separate deployment from release whenever possible — That is, if specific audiences or long validation times are needed, it is better to use A/B testing or feature toggles.

Note: These should be seen as strong guiding principles and as many as possible should be used, given the technical capabilities of the infrastructure. There are often changes to a service where it is not technically feasible to implement them all.

Phases of a deployment

The lifetime of a deployment can be described with the following phases;

All of the individual phases must complete for the deployment as a whole to be considered completed. There should be a way to observe the state of each phase throughout the deployment, for example by emitting events at state changes.

Blue system deployment — A full deployment of the service in a blue environment. The blue system takes no live traffic at this point.
Failure scenarios: New version and/or configuration failed to start, not enough resources to deploy a new system.
Pre-validation — Validation that the blue system deployment succeeded through smoke tests or synthetic traffic before taking live traffic.
Failure scenarios: Smoke test or synthetic traffic failures.
Canary traffic — Some live traffic is shifted from the existing, green, system over to the new, blue system.
Canary validation — Validation of live traffic on the blue system using business and service metrics determines that the service is operating within its targeted performance. Canary validation should only be on-going for enough time for the validation to complete, the recommended time frame is no more than 3–4 hours.
Failure scenarios: Business or service metrics indicate a drop in performance. A full demotion happens at this point.
Promotion — The canary validation was successful and the deployment can continue. If the canary validation was not successful a demotion happens.
Gradual rollout — Traffic is gradually shifted over from the green system to the new blue system.
Failure scenarios: Business or service metrics indicate a drop in performance. A full demotion happens at this point.
Post-validation — All traffic has been shifted over from the green system to the blue system. Validation of the blue system using business and service metrics determines that the service is operating within the targeted performance.
Failure scenarios: Business or service metrics indicate a drop in performance. A full demotion happens at this point.
Termination (green system) — The previous system is terminated, once this is completed the deployment is considered done. The blue system now becomes the new green system.

Demotion

If validation fails in any of the phases an instant demotion should happen and the blue system should immediately be taken out of use by diverting all traffic back to the green (previous), still running system. For demotion to be possible backward compatibility must be maintained, see the section on backward and forward compatibility.

Deployment lifetime

A canary should be promoted or demoted as quickly as possible, but still given enough time to validate the deployment to maximize service availability. The reason to minimize the canary lifetime is that the service is in an inconsistent state during the deployment process. The deployment is not considered finished until the previous version has been terminated and only the new, or old, version is running.

Validation

To validate that the service is operating within targeted performance it’s important to have business and service metrics. Service metrics are technical metrics and can be the number of HTTP status codes, while business metrics should track things relevant to the product (orders placed, sessions created, cards provisioned, etc).

Get Martin Elwin’s stories in your inbox

Join Medium for free to get updates from this writer.

In some cases, the coverage of live traffic can be incomplete. Synthetic traffic can help to drive metrics so they become statistically significant. For certain services or endpoints, synthetic traffic can be hard to implement.

Rollback

A rollback is semantically different from demotion and occurs after the deployment lifecycle has completed. In plain words; it’s a new full deployment of a previous, successful, deployment.

To achieve efficient rollbacks, immutable and versioned artifacts, binaries and configuration should be used to ensure that an identical instance of the previous state can be reproduced.

If the deployment mechanism has a queue of deployments, re-deployment of previously known working versions should be given priority over other deployments.

Application-specific considerations

Backward and forward compatibility

Any distributed system with high availability properties will require both forward and backward compatibility of data objects. Blue/green deployments make this more explicit as two versions of the service will be running parallel.

It’s therefore required that any shared state needs to be both backward and forward compatible. I.e the previous version of a service needs to be able to handle the state created by a newer version of the same service, and the other way around.

An example of forward compatibility is to ignore unknown properties during JSON deserialization. That way a new field can be added without breaking deserialization.

An example of backward compatibility is to rename a JSON field, but still try to access it using the old naming if the new naming does not exist.

Forward and backward compatibility should be documented in the interface of the service.

Database schema ownership

To extend the set of backward-compatible database schemas, it is highly recommended to have a single database owned and accessed by only one service at a time.

Breaking changes to data schemas

Sometimes it might be needed to break backward compatibility in data schemas. Breaking backward compatibility could for example be renaming a column in a relational database. Introducing a new column should not be a breaking change, rather the code should be written in a forward-compatible way and ignore unknown properties. For this reason, a rename could be handled by adding a new column and copying or moving data from the old column.

Major breaking changes should be rare and are therefore better treated as an exception rather than a normal deployment. It’s important to have a strategy to handle if the data schema change fails, or if the service itself starts failing after a data schema change. In these cases, it’s often easier to roll forward instead of backward. Rolling back or undoing schema changes and restoring from backups should be considered a last resort.

Example timeline of a successful container-based deployment

Pre-condition: Release 1.0 of a service is running in a green environment.
Event: Deployment of a new release, 2.0, is initiated.

Timeline:

A blue system is provisioned with release 2.0 of the service. It runs with an equal number of instances as the green system running release 1.0. The blue system takes no traffic at this point.
Warm-up of instances in the blue system running release 2.0. The blue system will enter a ready state indicating that it’s ready to take live traffic.
Pre-validation of the blue system running release 2.0 using synthetic traffic or other tests.
A small portion (1–5%) of traffic is shifted from the green system to the blue system to act as a canary.
Business and service level metrics from the blue system running are evaluated to verify that the release 2.0 operates within its targeted performance.
The canary is promoted to a full deployment. Gradual rollout continues until 100% of the traffic has been shifted to the blue system. Business and service level metrics are continuously evaluated during the rollout.
Validation that release 2.0 is still operating within its targeted performance continues for a time period until it’s deemed that the deployment is successful.
Termination of the green system running release 1.0
The blue system running release 2.0 now takes over as the new green system.