Networking issues - NFHN Reader

Summary

On May 23, 2022, at 10:30 UTC Sentry experienced networking issues across part of our infrastructure which resulted in some delays in ingestion, notifications, and alerts. We alerted our service provider, Google Cloud Platform, who restored connectivity. Our engineering teams encountered some issues with recovery before we were able to start ingesting new events at a normal rate and start processing the backlog of data which was fully completed by May 24, 2022, 02:05 UTC.

We apologize for any inconvenience this may have caused.

Timeline (in UTC)

May 23

10:25 — Responded to alerts that Snuba consumers were accumulating large backlogs of data.

10:30 — Identified the issues that originated within our metrics Clickhouse cluster.

10:37 — Experienced timeouts across our metrics services and some hosts became unreachable via ssh.

10:40 — Restarting hosts seemed to resolve the issue however one of our Kafka clusters became unreachable.

11:14 — Reached out to Google Cloud support

11:15 — We continued to restart other unreachable hosts with varying success

13:03 — Meeting with Google Cloud confirmed infra issues; we proceed with their recommendation to migrate our hosts

14:27 — Google Cloud investigation completed and mitigation underway

15:39 — Web hosts restarted on different nodes to fix Sentry.io availability

15:55 — Diverted Clickhouse traffic to healthy hosts

16:10 — Backend systems started ingesting new events; backend systems started processing backlog

16:35 — Google Cloud incident given all clear with a rollback of changes

16:37 — Ingestion had stopped; started to re-attached file storage with additional space

16:46 — Ingestion restarts at normal rate

17:12 — Snuba consumer memory increased to cope with large batches

18:27 — Completed Clickhouse replication completed and queries are evenly distributed again

22:59 — Optimize backlog processing job scheduled for 24:00

May 24

02:05 — Backlog processing completed

Duration of Instability: 9 hours, 5 minutes

Root Cause

A networking change was rolled out to Google Cloud Networking which caused some instances in Google Compute Engine in us-central1-b to become unreachable. This negatively impacted parts of our stack that could not successfully handle partial failures.

Remediation Plan

We have short term plans to improve our data backlog processing times by:

Adding connection pooling to Redis
Scaling up our Celery processing pipeline

Our longer term goal is to enable a standby Clickhouse cluster that resides in a separate zone/region. This is aligned to our overall strategy as outline from the May 6th incident to provide better failover capabilities.