I love thrillers, like everyone else.
After watching ‘Psycho’, ‘Rear Window’ and ‘Vertigo’ over the last weekend, I was looking at the Alfred Hitchcock’s movies section on IMDb to decide on what to watch next. ‘Dial M for Murder’ or ‘Strangers on a Train’, I was contemplating, going through the plot, cast etc.
Suddenly, IMDb was down!
Press enter or click to view image in full size
AWS outage is the culprit, I figured out. I hate outages, like everyone else.
Unlike everyone, for a change, I started trembling — Holy mother of God! I work for an email app which uses AWS!
An email service cannot go down! It just CANNOT!
I checked my email immediately (on Gmail web), there was an email from Sabya — the one who maintains the cloud @CloudMagic — our infra lead. “CM is down” the subject line said.
We sometimes use the acronym ‘CM’ while communicating internally. But, when we talk to our users or someone outside, it is always CloudMagic (Observe the capital C and M?).
Getting back to the email, “Looking into it” it said. He must have got an alert from our monitoring system. He immediately followed it up with “AWS — DynamoDB is down.” pointing to the Amazon Web Services’ Service Health Dashboard. (You see, it’s all green now, at least when I wrote this line it was, promise!)
We logged on to Twitter and sent this tweet:
Press enter or click to view image in full size
The worst hit in this AWS outage included Netflix, IMDb, Tinder, Airbnb and most importantly, our support team.
We were flooded with tickets, tweets, Facebook posts and queries from our beta users. Oh yes, we have an active beta community on Google+, which you are most welcome to join :)
We sent a personal reply to everyone who had reached out to us. There were hundreds of tickets within a few hours. It was scary (Not as scary as the shower scene in ‘Psycho’ though).
You know what? This made me happy! Not the part wherein our support team was ‘under the hammer’, but that part wherein several users reached out to us. Which means, they really give a damn about CloudMagic. Simple logic, right?
The outage lasted for 6 hours. Some of our users were happy to hear from us and some ran out of patience. Amid all this rather serious stuff, something funny was happening!
Bing Translate never fails to amuse you :
Press enter or click to view image in full size
Google translate and a Japanese friend helped us out.
Our backend team had lots to do after the outage. Increase in load! They had to scale our infra to handle the extra load generated by the backlog. They launched some extra instances and monitored the services for a couple of hours.
Finally, things were back to normal. Phew!
And then we sent out another tweet:
Press enter or click to view image in full size
We replied to everyone on Twitter, Facebook and every single email we received. And also, Play Store reviews. By now, it was not just the support team, but few others also, who were aware of the issue, got involved and contributed with few replies.
Some users came back and what they had to say made us forget all the pain we had been through:
Press enter or click to view image in full size
Press enter or click to view image in full size
By the way, did I mention that it was a Sunday?
Press enter or click to view image in full size