At Flipdish, reliability is a product feature. Restaurants use our platform to take orders and process payments, often during their busiest hours. That means “works most of the time” is not good enough.
Hence, we take the uptime and reliability of our systems very seriously. Our platform enables essential operations: taking orders and processing payments for restaurants worldwide, and we’re committed to ensuring these services remain available.
Much of the work done by the platform team (where I work) focuses on improving reliability at the infrastructure level; mitigating DDoS attacks, enhancing observability and rollback capabilities, deploying services across regions, replatforming for auto-scaling, and building our API management solution. The platform team also mucks in during incident responses to help get services back operational.
We take pride in our system observability, incident management process, and CI/CD automations and practices. Most incidents are within our control: we can tweak config, rollback deployments, shift traffic, and recover fast with minimal customer impact. When we have issues, we can restore service quickly with minimal impact on customers. In these situations, we have complete control over our systems and can act accordingly.
However, over the last couple of months, we’ve also been reminded of a reality every modern SaaS business lives with: sometimes the trigger is upstream. When a major cloud or edge provider has a bad day, you can’t “fix” the outage. What you can do is limit the blast radius, restore critical paths first, and use the experience to make the next incident less painful.
Below is a timeline of recent infrastructure challenges, along with details of two that impacted Flipdish services.
Timeline
October 20th: AWS suffered a significant outage in us-east-1. Fortunately, this DNS-related DynamoDB outage had minimal impact on Flipdish’s services because they were not hosted in us-east-1; however, it underscored the importance of comprehensive resilience strategies, even when infrastructure challenges occur outside our direct systems.
October 29th: Incident 192; Azure Front Door / CDN has a global outage. When Azure Front Door experienced a global outage, we weren’t as lucky. Heading into our dinner-time traffic peak, ordering across all consumer channels was affected.
November 18th: Incident 195; Cloudflare experienced a global outage triggered by a data-plane configuration issue. Flipdish uses Cloudflare to proxy production traffic, so the scope of impact was broad and required rapid mitigation decisions under time pressure.
Incident 192: Azure Front Door
Our automated alerting quickly detected that key menu data files used by all our frontend clients were inaccessible. An incident was called, and members of the platform team joined the Google Meet as the impact was being assessed. We determined the menu data was inaccessible due to Azure Front Door (AFD) being offline. Our audit revealed:
- Customers using mobile apps or in-store kiosks couldn’t order as the menu data was unavailable
- Client websites were loading but non-functional due to missing static files
- Portions of the Flipdish Portal weren’t loading
- A payment gateway used by some POS clients wasn’t loading
We prioritised getting the menu data back online first. This data would also be required to unblock ordering on websites even once the static files were fixed, but by tackling this first, we’d get kiosks and apps back while working on the website files.
After some discussion, we decided to try to use Cloudflare’s proxying capabilities to mitigate the issues. We enabled Cloudflare proxying with Azure Blob Storage as the origin and manually configured a couple of Cloudflare rules to ensure Azure accepted the traffic. The solution worked, and the menus were back online!
We then applied this mitigation manually to other Azure Front Door endpoints. This worked across all endpoints except the one hosting website static files, which required an additional rule to handle CORS headers.
Ordering impact was mitigated by 17:30, and full client impact was mitigated by 18:40. While the initial outage was out of our control, we manually restored service. Customer impact was limited to 1 hour for our crucial online ordering product, and all Flipdish services were back online within 2 hours. This was considerably faster than the 8.5 hours Microsoft reported for Azure Front Door restoration.
Incident 195 : Cloudflare
On November 18th, we began receiving alerts that client websites were returning errors and reports that the Flipdish Portal was unavailable. We declared an incident and coordinated response while validating what was failing and what we could control.
This was a high-severity event with widespread impact because Cloudflare proxies our production traffic. Complicating matters, Cloudflare’s dashboard was intermittently unavailable during the incident, limiting control-plane access. Our Cloudflare configuration is also intentionally comprehensive, including WAF, rate limiting, and routing to multiple backend services, which made “quick reconfiguration under pressure” an unsafe option. Undeterred, we decided to move systems to a degraded state that could still serve our most critical requests.
During a brief period of mostly successful requests and a moment of quick thinking, my colleague on the platform team, Mykhailo, logged in and generated an API token. Thankfully, the Cloudflare API remained available throughout, allowing us to update DNS records. As our most business-critical product, we targeted getting POS back online first. Using the API, we disabled Cloudflare proxying and observed a rise in request success rates. Success! For other services (such as online ordering), the transformation and routing requirements were too complex to safely replicate during the incident window. Given Cloudflare’s status updates and the risk of a complicated unwind, we chose to keep changes minimal and reversible.
Get James Macdonald’s stories in your inbox
Join Medium for free to get updates from this writer.
Once Cloudflare confirmed the incident was resolved, we reverted the manual changes and fully restored our systems. Despite earlier fears of a complex rollback, we found that we only needed to apply our Terraform to return the config to baseline.
What went well
Despite there being downtime affecting many clients during the incidents, there are still positives we should note:
- Our automated monitoring triggered. This allowed us to declare an incident and begin assessing impact and solutions quickly.
- Our incident management process works well. The engineering team knows our services well, and the platform team knows our infrastructure well. This makes incident response calls effective: we quickly generate and vet potential solutions to outages and implement them promptly.
- IaC allowed us to very quickly revert any manual changes made during incident response once Cloudflare restored services.
What we improved
These incidents also exposed areas where we could reduce manual work and speed up recovery:
During Incident 192, we had multiple separate Azure Front Door instances that each required manual reconfiguration in Cloudflare. This caused incremental recovery instead of restoring everything in one step. We recognised that consolidating these into a single, centrally managed system would enable faster recovery in future incidents, which directly informed our cdn.flipdish.com project we built in response.
During Incident 195, limited dashboard access and the complexity of the Cloudflare configuration made it impractical to fail over to another provider within the available timeframe safely. Even under time pressure, we successfully failed over our most critical systems. Still, we knew we could further streamline the process, leading to our redundant API reverse-proxying solution.
What we learned
Everyone can have bad day. All the major providers have had outages recently. There’s no point trying to pick the most reliable one. We need redundant failover systems that mirror our primary systems as closely as possible. Some degradation is acceptable; backups only need to keep core operations running during short outage periods. Reading about Monzo’s Stand-in platform really opened my mind about what’s possible here.
Centralisation isn’t all bad. Flipdish’s transition from a monolith to a distributed services architecture has significantly improved developer productivity, with release times plummeting and features shipping faster than ever. But this distribution makes it harder to make widespread, rapid changes during incidents. We should maintain some centralised systems where the platform team can implement cross-cutting concerns such as security and disaster recovery, while reaping the benefits of distributed services and retaining the ability to make fast, global changes when needed.
Systems should fail over automatically; otherwise, we need a big red button. Centralised systems should either fail over automatically or include a “big red button” that anyone in Product Engineering can “press” to reroute traffic to the backup.
What we shipped as a result
In light of Incidents 192 and 195 and lessons learnt, the platform team has initiated two projects to enhance the reliability of our services during underlying provider outages.
cdn.flipdish.com
In response to Incident 192’s slow recovery (many AFD endpoints each requiring manual reconfiguration), the platform team has created a single shared CDN for all Flipdish static files. cdn.flipdish.com stores files in both S3 and Azure Blob Storage, with Cloudflare as the primary CDN (Azure Blob origin) and a fully redundant AWS CloudFront CDN (S3 origin).
Developers configure containers in a TypeScript file, which creates containers across both storage providers and adds required configuration to both CDNs (e.g., SPA routing, custom domains). Container-scoped credentials prevent apps from writing to the wrong container, and we provide Azure DevOps tasks and GitHub Actions for CI/CD integration. Below is an example of how we can configure containers on the CDN:
// Container configurations
// Each container represents an application that needs its own storage space
containers: [
{ name: "fonts", },
{ name: "icons", contents: "./files/icons" },
{
name: "serverless-app-template",
hostingOptions: spa("serverless-app-template.flipdish.com")
},
]Finally, we need a big red button! To switch between the primary and failover CDN infra, we update the DNS records for cdn.flipdish.com in Cloudflare. To make this swap, all that is required is an update to the same TypeScript configuration file where containers are defined. All a user needs to do is swap the active declaration (shown below) to the AWS provider. On commit, the pipeline runs and updates the required configuration in under 1 minute. After a short DNS propagation delay, clients will begin connecting to the backup CDN.
// Simply set the active backend to the one you want to use, only one can be active at once.
backends: [
{ provider: "cf-azure", active: true },
{ provider: "aws" },
]Redundant API reserve proxying
With static files handled, we turned to implementing redundant API routing. Thanks to our previous investment in a high-level configuration repo (transpiled to Cloudflare config and applied via Terraform), we were in great shape. The routing config lets developers route path segments to their apps using various proxying options. A small section of this high-level configuration is shown below:
export const routes: { [name: string]: RouteConfig } = {
"/*": {
load_balancer_pool: "<zeus consumer backend>",
},
"/api/*": {
backend: "<zeus generic api backend>.azurewebsites.net",
host_header: "api.flipdish.co",
},
// other routes...
"/subscriptions/*": {
backend: "<subscriptions backend>",
strip_initial_path: "/subscriptions"
},
"/communication/*": {
backend: "<communication backend>",
strip_initial_path: "/communication"
},
};With our complete API configuration already defined as code, it took Mykhailo just 3 days to set up a backup AWS API Gateway with the same configuration! And, of course, there is a big red button in the same config: a single flag to toggle our API DNS records to point to AWS instead of Cloudflare.
What’s next?
We’re in a much better position than last month, but we still rely on manual failover actions; we may look to automate them fully.
There’s also the question of what happens if Cloudflare’s DNS and API don’t stay up next time. If we can’t switch our DNS to reroute traffic, the redundant systems will remain…redundant. AWS announced an “accelerated recovery” feature for Route53 in late November, and Cloudflare’s CTO promised a detailed breakdown of resiliency projects following their November (and December!) outages, including “streamlined break glass capabilities” for control-plane access.
DNS-based failover is also imperfect in practice because clients can ignore TTLs. We may explore additional strategies, such as backup domains (for example, backup.api.flipdish.co) and controlled client switching mechanisms, while being mindful that dependency chains can fail in correlated ways.
As it turns out, building resilience against cloud outages is turtles all the way down. Resilience is an ongoing practice. The work above materially improves our ability to keep critical restaurant workflows running during provider incidents, and we will keep investing in defence-in-depth and faster recovery. For now, we’re better prepared to weather whatever infrastructure storms come our way. 😎
Press enter or click to view image in full size