Managed Kubernetes Service

Incident Summary

Beginning October 8th around 06:00 UTC, we experienced an issue with a subset of Managed Kubernetes clusters having their nodes go into NotReady status, causing services and applications on these nodes to be disrupted. Due to an unexpected auto-upgrade process on the Debian image used by DigitalOcean, customers experienced downtime on worker nodes starting between 06:00 and 07:00 UTC on October 8th, 9th, and 10th. This ended up affecting users of Managed Kubernetes on cluster versions 1.27.6-do.0 and 1.28.2-do.0 across multiple regions.

‌

Incident Details

Root Cause:

The root cause for this incident was found to be a systemd timer in the DigitalOcean Debian image which automatically kicked off an apt-upgrade at 06:00 UTC (with a 60-minute randomized delay). The upgrade brought in unexpected/unsupported versions of various components, including systemd and the kernel on DOKS cluster versions 1.27.6-do.0 and 1.28.2-do.0. The kernel upgrade in particular could cause the upgrade process to take longer than 15 minutes, at which point systemd would abort the process ungracefully. This cancellation could leave worker nodes in a broken state.

‌

Impact:

The timeout leading to the upgrade process being canceled led to a disruption in both public and private network connectivity for the worker nodes in the majority of cases. Customers experienced their worker nodes going into NotReady status and becoming unavailable at the time of the upgrade, remaining unavailable until the nodes were replaced or rebooted. In some cases, a single replacement/reboot was not sufficient for recovery and multiple were required. This impact recurred twice more after the first event, around 06:00 UTC on October 9th and 10th.

Response:

The initial triaging done by the Containers Engineering team quickly scoped the problem down to the worker nodes. Unfortunately, the complete network failures dominantly faced on the affected machines impaired efforts to log into the worker nodes for in-depth troubleshooting. Managed Kubernetes worker nodes come with limited accessibility to Containers Engineering for security reasons, which required the team to provide machine access via SSH and the console while the impact was not occurring. This prolonged the time until in-depth investigations could commence.

Once nodes became accessible, it became apparent that processes running on a given machine failed to connect to remote destinations both over the VPC and the public network. Early investigations focused on trying to understand which section of the data path may have been disrupted and why. This involved running several traces across the worker nodes and the corresponding hypervisors. The results surfaced broken connectivity at the guest OS level, primarily manifesting as response packets being delivered to the machine successfully but not further relayed to the originating processes (including kubelet and workload pods). This ruled out the data path as the source of problems.

At this point, customers were actively advised to try to replace or reboot machines in order to mitigate the impact.

The investigation focus then pivoted over to the periodic aspect of the incident, which had become more clearly apparent in subsequent occurrences, always starting around the same time of the day (06:00 UTC). The limited impact on two specific Managed Kubernetes versions pointed towards a regularly executing process on the worker nodes themselves.

After ruling out any cronjobs as the culprit, the systemd timer upgrading all packages on the Debian-based worker node was identified with a schedule of 06:00 UTC daily (plus a random delay of up to 60 minutes). The team then decided to run the timer manually on a test cluster. During that test run, it was observed how a lengthy upgrade process involving numerous system-critical OS packages was being kicked off, including systemd and the kernel. Although the effects of an expiring 15-minute timeout leading to the harsh termination of the upgrade process could not be fully validated at this point, sufficient data points were collected to consider this very timer job to be the root cause for the incident.

Timeline of Events in UTC

October 8th:

07:43 - Sporadic incoming customer reports arrive, indicating worker node failures that can be addressed by issuing node replacement

October 9th:

06:31 - Broader impact occurs with worker nodes affected across multiple regions
07:10 - Internal incident response kicks off, investigation begins
08:13 - Access is gained to logs of a previously affected worker node; machine-wide request failures are confirmed
09:43 - Additional Networking teams provide support to pinpoint the guest OS as the likely root cause; Engineering access to affected nodes is still inhibited
15:28 - Tooling work enabling better analysis on recurrence of the issue is completed

October 10th:

06:05 - Impact reoccurs with first customers
06:18 - Engineering is able to access and live-debug on an affected worker node
13:08 - The auto-upgrade timer is discovered
14:28 - Behavior of the auto-upgrade timer is confirmed and tied to the incident
14:36 - Work starts to build a DaemonSet-based mitigation fix disabling the auto-upgrade timer on each worker node

October 11th:

03:08 - The fix is released to all affected clusters
06:00/07:00 - No new disruptions/recurrence of the issue are observed

Remediation Actions

In order to avoid another occurrence of the incident on October 11th or later, a quick fix was put together and released. This fix consisted of a DaemonSet workload that disabled the relevant systemd timer on a worker node. This mitigated the problem on all current and future worker nodes.

The next step in remediation involves applying the same fix during our node provisioning process so that the worker nodes no longer have the timer enabled. Once completed, the DaemonSet can and will be removed again from all clusters.

Additional validation is also added to our internal conformance test suite running on all new cluster versions ensuring that the auto-upgrade process and any other undesirable timers continue to stay disabled going forward.

Finally, work is planned to support logging into worker node machines even when conventional access trajectories are impeded.