Automating safe hands-off deployments

Automated deployments in the pipeline typically don’t have a developer who actively watches each deployment to prod, checks the metrics, and manually rolls back if they see issues. These deployments are completely hands-off. The deployment system actively monitors an alarm to determine if it needs to automatically roll back a deployment. A rollback will switch the environment back to the container image, AWS Lambda function deployment package, or internal deployment package that was previously deployed. Our internal deployment packages are similar to container images, because the packages are immutable and use a checksum to verify their integrity.

Each microservice in each Region typically has a high-severity alarm that triggers on thresholds for the metrics that impact the service’s customers (like fault rates and high latency) and on system health metrics (like CPU utilization), as illustrated in the following example. This high-severity alarm is used to page the oncall engineer and to automatically roll back the service if a deployment is in progress. Often, the rollback is already in progress by the time the oncall engineer has been paged and starts engaging.

Example high-severity microservice alarm

ALARM("FrontEndApiService_High_Fault_Rate") OR
ALARM("FrontEndApiService_High_P50_Latency") OR
ALARM("FrontEndApiService_High_P90_Latency") OR
ALARM("FrontEndApiService_High_P99_Latency") OR
ALARM("FrontEndApiService_High_Cpu_Usage") OR
ALARM("FrontEndApiService_High_Memory_Usage") OR
ALARM("FrontEndApiService_High_Disk_Usage") OR
ALARM("FrontEndApiService_High_Errors_In_Logs") OR
ALARM("FrontEndApiService_High_Failing_Health_Checks")

Changes introduced by a deployment can have an impact on upstream and downstream microservices, so the deployment system needs to monitor the high-severity alarm for the microservice under deployment and monitor the high-severity alarms for the team’s other microservices to determine when to roll back. Deployed changes can also affect the metrics of continuous canary testing, so the deployment system additionally needs to monitor for failing canary tests. To automatically roll back on all of these possible areas of impact, teams create high-severity aggregate alarms for the deployment system to monitor. High-severity aggregate alarms roll up the state of all of the team’s individual microservice high-severity alarms and the state of the canary alarms into a single aggregate state, as in the following sample. If any of the high-severity alarms for the team’s microservices go into the alarm state, all of the team’s ongoing deployments across all of their microservices in that Region automatically roll back.

Example high-severity aggregate rollback alarm

ALARM("FrontEndApiService_High_Severity") OR
ALARM("BackendApiService_High_Severity") OR
ALARM("BackendWorkflows_High_Severity") OR
ALARM("Canaries_High_Severity")

A one-box stage serves a small percentage of overall traffic, so issues introduced by a one-box deployment might not trigger the service’s high-severity rollback alarm. To catch and roll back changes that cause issues in the one-box stage before they reach the rest of the prod stages, one-box stages additionally roll back on metrics that are scoped to only the one box. For example, they roll back on the fault rate of the requests that were served specifically by the one box, which makes up a small percentage of overall number of requests.

Example one-box rollback alarm

ALARM("High_Severity_Aggregate_Rollback_Alarm") OR
ALARM("FrontEndApiService_OneBox_High_Fault_Rate") OR
ALARM("FrontEndApiService_OneBox_High_P50_Latency") OR
ALARM("FrontEndApiService_OneBox_High_P90_Latency") OR
ALARM("FrontEndApiService_OneBox_High_P99_Latency") OR
ALARM("FrontEndApiService_OneBox_High_Cpu_Usage") OR
ALARM("FrontEndApiService_OneBox_High_Memory_Usage") OR
ALARM("FrontEndApiService_OneBox_High_Disk_Usage") OR
ALARM("FrontEndApiService_OneBox_High_Errors_In_Logs") OR
ALARM("FrontEndApiService_OneBox_Failing_Health_Checks")

In addition to rolling back on alarms defined by the service team, our deployment system can also detect and automatically roll back on anomalies in common metrics emitted by our internal web service framework. Most of our microservices emit metrics such as request count, request latency, and fault count in a standard format. Using these standard metrics, the deployment system can roll back automatically if there are anomalies in the metrics during a deployment. Examples of this are if the request count suddenly drops to zero, or if the latency or number of faults becomes much higher than normal.