What Happened?
On 14th September, 2023 8:05 AM UTC (all timestamps in UTC) there was a critical bug in the output of our geo data pipeline which resulted in a number of geo locations being set as parents of themselves. This caused disruption to our service, and for this we’re sorry. It gave us the opportunity to evaluate what went wrong, what we learned, and ways we could prevent a situation like this from happening again.
Let’s take a deep dive and explain things further.
What is Geo Data?
Geo data is a key dataset at Skyscanner which is used to provide systems, industry partners and travellers with a complete and accurate representation of the world. In simpler terms, any time you see an Airport, City, Region or Country used in Skyscanner, it’s originating from this dataset. The most visible example across our offering can be seen when searching for flights where you will specify an airport, city or country for your journey.
Press enter or click to view image in full size
We use the geo data to populate origins/destinations and look for flights
What was the issue?
Skyscanner has been on a journey to upgrade our geo data set. At this time there are two versions, two geo models, running in parallel. Flights generally need to know about airports, cities, countries, contrary to other parts of the business where we need to model more complex relations such as districts, countries, islands, etc. Those relationships form a complex graph where locations are related to each others as parents and children.
For this reason we kept our original “heritage” data set, merged it with our canonical dataset, our source of truth. We then basically generate (or reconstruct) our heritage dataset from the canonical data every day at 8am UTC. This generation is referred to as materialisation.
Press enter or click to view image in full size
On 14th September, a bug in the materialisation process updated some locations to be the parent of themselves. For example Scotland is now parent of Scotland. We’ve created a loop in the geo hierarchy.
You can immediately see how this is a problem when our libraries and systems expect some structure in the parent/children relationships.
How did this impact flight search?
The flight stack (our critical systems involved in flight search) loads this geo data and uses it for every search coming in from travellers. Because we serve lots of travellers, we scale the stack to multiple Kubernetes clusters in multiple regions in the world, totalling several hundreds of pods throughout the world.
Anytime a search involving the affected entity came in, a thread was stuck in an infinite loop searching for the parent entity. This happened in all the pods with corrupted dataset.
Press enter or click to view image in full size
Every request with affected entity stole one thread from the service
Every stuck thread made the CPU work harder with memory also not being released by threads and as a result became unavailable. This resulted in CPU throttling, which in turn caused our autoscaler to provision more pods to cope up with new requests.
Press enter or click to view image in full size
CPU Throttling
To make matters worse, an upstream service as part of a warmup request, was using an affected location giving our systems no chance to even start up and serve requests successfully.
All of this created the perfect storm, and the strain was felt throughout all our systems which were fighting for resources or depending on one of the affected systems as well.
What was the impact?
Our flight search was degraded for travellers. The error profile looked like this. In total, we served over one million errors to travellers searching for flights.
Press enter or click to view image in full size
How did we fix it?
Once the data issue was identified, we were able to output the correct data after re-running the pipeline. The key thing here is the long time it took us to diagnose the root cause. Let’s talk about that.
It took us a long time to diagnose the root cause …
- Bias
- We have a bias towards performance issues which led us to a diverse set of investigations away from the cause.
- A particular service was a source of previous incidents and all our efforts were concentrated on this service through the addition of more computing resources to alleviate the pressure.
Telemetry
- We relied heavily on existing telemetry (metrics, traces, logs) which is the right thing to do but …
- … it can only get you so far when things don’t crash. We ended up attaching a JVM profiler to obtain thread dumps and finally identify the infinite loop!
What did we learn?
- Treat data changes like code changes: We did not immediately audit all the data changes which might have affected production. This should have been included in our analysis at step 1.
- Beware of recency bias: Our recent incidents originated from a single service. Our attention immediately focused on recovering this service also in distress.
- Practice running large incidents frequently: As we’ve matured our platform and services have become more resilient. This reliability resulted in fewer, less impactful incidents. While our engineers do frequently run their own wargames their scope is often limited by experience and imagination.
- Look at all the signals: Thinking about mitigation is correct but don’t increase compute resources without establishing if you’ve experienced a genuine increase in requests.
- Recap: For longer-running incidents take regular timeouts to step back and recap on what we’ve learnt.