SUMMARY:
Firebase Hosting was partially to completely unavailable for serving traffic between 1:00am and 7:30am Pacific time on February 1st. This outage also affected non-email/password sign-in methods of Firebase Auth for the majority of users.
DETAILED DESCRIPTION OF IMPACT:
Just before 1:00am on February 1, the CPU load of Firebase Hosting's origin servers increased to 100% across the board. At 1:19 performance had begun to degrade and the oncall engineer received an alert. By 1:55am three Firebase Hosting engineers were investigating the disruption to find potential mitigations.
This affected between 5-20% of all Firebase Hosting traffic, as requests for cached CDN content were unaffected by the outage. However, all sites without large amounts of traffic, that had deployed recently, or with large numbers of URLs were likely suffering partial to total outages during the incident timeframe.
After several mitigation attempts, services were restored to 100% at approximately 7:30 a.m. The timeline of affected traffic is as follows:
01:00-01:30 - Load increases and origin servers begin to get unhealthy.
01:30-04:00 - All or nearly all origin traffic is disrupted by CPU load
04:00-04:30 - 50% of origin traffic is restored, provisioning of new VMs begins
06:30-07:00 - 60% of origin traffic is restored as new VMs are brought into rotation
07:00-07:30 - 80% of origin traffic is restored
07:30 - 100% of traffic is restored, incident ends
ROOT CAUSE:
Origin traffic and CPU load had been increasing since January 21. This went unnoticed by the Firebase Hosting team due to insufficient internal monitoring of peak load outside of business hours. On February 1st core services began to fail as a result of increased load. Unhealthy servers were taken out of rotation by the load balancer, increasing the load on the remaining servers and causing cascading failures that could not be solved without reduced load or increased capacity.
REMEDIATION AND PREVENTION:
This was the first incident of a load-based failure of the Hosting origin servers, and the symptoms of the problem were not clearly diagnosed during the early phases of investigation.
We are taking a number of steps to prevent incidents like this from occurring in the future:
- Provision additional capacity to ensure that our capacity well exceeds peak load.
- Continue migrating our deployment and serving infrastructure to make better use of internal infrastructure available at Google with an eye toward better management and monitoring of peak load conditions.
- Improve monitoring and growth alerting for peak origin traffic, paying specific attention to peaks that happen outside business hours.
- Improve processes and tooling for provisioning new VMs, allowing faster response times in the future.
- Investigate performance improvements to keep CPU pressure off of our origin servers.