Incident Report
Summary
On 8 August 2024, Gmail and Google Drive experienced service degradation globally for a duration of 4 hours and 10 minutes between 12:06 and 16:16 US/Pacific. During the incident, affected users experienced issues with email attachment and delivery functionalities in Gmail, and upload operations in Google Drive.
To our Gmail and Google Drive customers, we apologize for the impact this service disruption had on your organization. We have completed an internal investigation and are taking immediate steps to improve the quality and reliability of our services.
Root Cause
As part of a standard data operation, we restored a large volume of data into an internal Bigtable database. This operation resulted in Bigtable servers being overloaded.
During the restore operation, Bigtable identified certain tablets as ‘unloadable’ and added the corresponding entries into a [1]Chubby file, as expected. As the file size grew, it exceeded the memory limit allocated for Chubby, resulting in Bigtable being unable to write to the file.
The cumulative impact of these events led to a subset of Bigtable instances entering a degraded state, impacting read/write operations. These instances store internal data for Gmail and Google Drive, which consequently affected the performance of those services.
[1] - https://research.google/pubs/the-chubby-lock-service-for-loosely-coupled-distributed-systems
Remediation and Prevention
Google engineers were alerted to the issue on Thursday, 8 Aug 2024 at 11:58 US/Pacific via our monitoring system. Once the nature and scope of the issue were understood, our engineers devised and executed a multi-pronged approach to address the root cause and mitigate impact. The Chubby quota was increased to address the memory issue, while traffic from the affected instances was re-routed to avoid further impact. To ensure that any degraded instances were addressed, Bigtable master servers for the affected instances were successfully restarted. Impact was fully mitigated by 16:16 US/Pacific.
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.
We are committed to preventing a repeat of this issue in the future and are completing the following actions:
- We have completed a deep analysis of our Bigtable instances and ensured that the Chubby space quota limits and configurations are optimal across all instances.
- We are working on enhancing our monitoring for Chubby quota usage in Bigtable to enable early detection and prevention of potential issues.
- We will establish clear guidelines on the recommended volume of data that can be safely restored at any given time, minimizing the risk of service disruptions.
- We have enhanced the logic for diverting traffic from Bigtable clusters to ensure smoother transitions and minimal impact on users.
Detailed Description of Impact
On Thursday, 8 August 2024 from 12:06 to 16:16 US/Pacific, Gmail and Google Drive experienced service degradation for a duration of 4 hours and 10 minutes.
Gmail
Affected users experienced issues with attachment functionality wherein they were unable to send emails or save drafts containing attachments. There was no impact to emails sent without attachments.
Google Drive
Affected users may have observed degraded performance while performing upload operations.