Google Cloud outage brings down Layer
status.layer.comI am so turned off when I click a "Pricing" link and get a contact form. Even more so when I read that, "our pricing team" [will get back to you].
So, you have an entire team of people who will try and maximize how much I pay? Sounds like a great experience doing business with you. /heavysarcasm
I automatically skip any product where I have to speak to a human at any time
Is it me, or are a lot of web-based service providers very chatty lately?
I won't name and shame any particular ones, but I will say I've found myself regretting signing up for trials of certain services because of the almost sycophantic attention I'd receive from the oh-so-personable and friendly CEOs who make it a point to personally message all customers. I usually respond, initially, but then it quickly becomes pushy and intrusive, e.g. "Hi, I've noticed you haven't used [x] feature yet." "Hello? Are you getting my emails?" "Hello?"
I don't mean to be rude, but I didn't sign up for the "omg you're so friendly and amazingly helpful" show. I just wanted to try the service out. Kindly stop breathing down my neck! :/
This happens because it works, though not necessarily so much for the HN crowd.
Does it though? In my experience a good product or service doesn't require constant spamming. It has nothing to with whether someone reads HN or not.
Does your request for confirmation incorporate the ancient HN discussion I linked?
I'd wager the initial greeting works, but I question whether the person's lack of self awareness (when it's clear that the customer doesn't have time to do small-talk and has been evading you for 3 weeks straight) wouldn't be grating to most people who are trying to evaluate several products and get some work done.
I've started reporting the worst offenders as spam. One service was e-mailing me once a week. After the 3rd e-mail I vowed never to use their product out of principle. After the 5th I'd had enough of the harassment.
The one I have in mind sent 4 emails in the same week, with a linear increase in bewilderment that I wasn't answering promptly beyond the first two (where I made it clear that I wanted to take my time).
Agreed, that's where I close the browser tab. I also wouldn't shop at a store where merchandise wasn't priced.
I wonder if companies trying to make money should invest their time in talking to you.
Don't like the attitude. Pointing fingers doesn't help paying customers trapped by Layer poor design choices.
Especially when they seem to be referencing only a single region. Multi-region deployments is the most basic protection against outages when using IaaS.
They should at least be in multiple availability zones. Multiple regions often comes with a lot of challenges, but there isn't much reason not to be redundant in multiple AZs.
> Multiple regions often comes with a lot of challenges
As an almost-customer of Layer (before their massive price increase), they led me to believe that this was one of the problems they would be solving for me. Nowhere on their website does it say, "We save money by not following best practices, so plan accordingly for occasional outages!"
Were they or were they not in multiple AZs? Developing for multiple availability zones is trivial when creating cloud-first software (and it's irresponsible not to use AZs!), multi-region comes with its own set of problems.
In the report they say they're looking at moving to a new region, but Google apparently told them that us-central1-a was down. The "-a" makes it an AZ. It sounds like they're only on one AZ and may not fully understand the difference.
[correction: they accidentally called usc1-a a region, but everything mentioned in their outage was a zone. They specifically called it a "deployment zone" not an availability zone, so it sounds like an issue of inexperience with best practices.]
[obligatory disclaimer: I'm a Google employee. I don't have a relationship with Layer]
Yup. Just as too few companies realize that cloud doesn't mean one remote box replacing one local box, all eggs in one regional basket (or even single cloud provider) is unwise.
Blake from Layer here: I've reviewed the updates from last night and I don't feel like the tone was out of line. We were simply trying to provide our customers with complete transparency about where the issue was and where we were in restoring service.
With that said, we do feel that Google came up in short in their responses to us over the course of the issue. We pay handsomely on a support contract to get off-hours responses and issue escalations. The responses we received were hand-wavy and vague, leaving us without sufficient data to make decisions. We have raised these concerns with our Google representative and will be working with them to tighten our partnership going forward.
We take full responsibility for this event and are working to cover the exposure. Building a system and business with resource constraints and complex distributed technologies is a long game of managing risk and trade-offs. We're human and we make bad calls along the way. We are very sorry and violated our commitments to our customers and their users. The entire Layer engineering team is head down right now working to make it right.
That was my first thought as well. Why did they need to start migrating customers to another AZ? I hope their customer's started asking that as well. The title should be "Poor Design Choices Brings Down Layer"
> As we are now several hours into this outage and do not have satisfactory timeline for resolution, we have begun the process of migrating our hosts into another deployment zone within GCE
Wait, what? Isn't running in multiple zones something like rule #1 or #3 in "how to run in the cloud"?
So why did they not already do this?
> As we are now several hours into this outage and do not have satisfactory timeline for resolution, we have begun the process of migrating our hosts into another deployment zone within GCE. We will have a baseline set of services migrated within the hour and evaluate our ability to operate in a split deployment. Should we need to pursue a complete migration of hosts across zones then we would expect another 4-5 hours to return to full operational capacity.
Wait, their service isn't setup to operate in a split environment out of the box? I think it's time SaaS companies start documenting their IaaS setup so purchasers can do a high level audit before they decide to use it for potentially a core part of their own product/service.
I imagine if one were a customer of this SaaS, it's on the customer to ask what availability to expect.
Clearly this vendor thought that their savings on their IaaS bill outweighed any operational or reputational risk they'd suffer from an outage at a lower layer (pun unintended).
Blake from Layer here: We are forthright with all our customers about our current deployment configuration and the roadmap timelines for evolving into a deployment with higher availability characteristics. There is real complexity in operating a system such as ours in a widely distributed configuration and like any other company at our stage we regularly assess risks and make trade-offs. Sometimes we get things wrong.
We are very sorry to all our customers for the downstream impacts their businesses. We came up short and are doing everything we can to make it right.
I agree. If you're going to use abstracted infrastructure but you don't understand basic distributed architecture you shouldn't really be blaming your cloud provider.
Agreed, and to reply in the context of a critical comment someone further up noted (about never wanting to deal with any SaaS provider that required talking to a human) ... these are the kinds of reasons -- asking for architecture details or regulatory/security audits -- you NEED to be able to talk to humans, especially if you're operating in a regulated industry yourself or you're trying to sign an enterprise agreement with far-reaching consequences.
Hey, Blake from Layer here: we regularly undergo architecture and deployment reviews with our customers. We are fully transparent with the current deployment configuration and timelines for revisions.
Last night we lost a race to evolve our architecture and deployment ahead of a zone level issue that affected our total operations. We are working on it in earnest but there is very real complexity in operating a widely distributed real-time system.
I like how Algolia does that, https://www.algolia.com/infra (their blogposts and presentations go into much more detail)
Currently thinking of creating a similar page for getstream.io, at the moment we always explain it during sales/onboarding calls. (we replicate our data to 3 different instances across multiple AZs)
Thanks for the link. We are taking a look at what Algolia has done here and will likely put together a public infrastructure overview page for Layer as well.
More inspiration: https://www.mapbox.com/platform/
It doesn't do any good to point the finger at your vendors when your service goes down; that data isn't useful for your customers. Never forget the lesson of http://www.whoownsmyavailability.com/
Hey, Blake from Layer here: We were not at all trying to finger-point our issues at Google, only provide our customers with up to the minute transparency on where we were with the availability issue. We have received direct feedback from our customers that they do value detailed responses even when the news isn't great.
I take full responsibility for the issues here and the team is working to remediate the exposure as quickly as possible.
I'm not sure I agree. Customers like to know why it doesn't work. If it was a physical machine, they would have said something like "the disks are broken and we are replacing them". But it is cloud and they said "Google persistent disks are currently unavailable and they are fixing it".
But the real reason is "we didn't set up our system properly".
This is like saying "Hitachi Storage hard drives broke" when you actually mean "we didn't run RAID".
You can't compare persistent disks failing in a whole zone, with a RAID array failing in a single machine.
There is a reason why Amazon and Google takes EBS/Persistent Disk failures very seriously: there are not supposed to be unavailable during several hours, except if the whole datacenter is unable to operate (flood, fire, etc.), but it's not the case here.
If your RAID fails, and you have a support contract which guarantees restoration within 1 hour, and it's not restored within 1 hour, then I think you can legitimately say something was wrong at your provider. It's not pointing fingers. Everyone does mistakes. It's taking responsibility.
That said, I agree they should have run in multiple zones, as recommended by Google, if they need/want to avoid that kind of downtime.
But I maintain Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future, instead of saying "don't point finger at us, it's supposed to happen".
Two clarifications: the disks were not "unavailable", they had high latency (slow I/O) in one zone only (us-central1-a); and this affected only SSD PDs, not "regular" PDs. Per the SLA [1], it's "downtime" when PDs are completely unavailable for >5 minutes in at least two zones, and neither condition was met here.
[1] https://cloud.google.com/compute/sla
All that said, people choose SSD because it's faster and has higher throughput, so SSDs not being fast is obviously a real problem for applications relying on this, and rest assured we are indeed doing whatever we can to avoid this in the future.
Disclaimer: I work in Google Cloud Support.
this is a typical Google Cloud Support response (I used to host on GCloud). Stretching the definitions to somehow get out of responsibility. If the SSDs have super high latency, then for most purposes they are indeed 'unavailable'. There is a reason why the user provisioned SSDs and not a regular disk.
> That said, I agree they should have run in multiple zones, as recommended by Google
If you don't follow your vendor's recommendations for how to use their product, how can you blame them when that exact recommendation would have saved you?
> Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future
Sure. And the power to my office is not supposed to go out (and I've certainly worked in places where there has never been an unplanned power outage in decades), but if my business relies on it I need a UPS.
> instead of saying "don't point finger at us, it's supposed to happen".
It's not, and they shouldn't. Also unless you know something I don't, they didn't.
> If your RAID fails, and you have a support contract which guarantees restoration within 1 hour,
But as other commenter pointed out: Google did not violate the SLA during this, apparently. So…
> > instead of saying "don't point finger at us, it's supposed to happen".
> It's not, and they shouldn't. Also unless you know something I don't, they didn't.
Sorry, my comment was confusing. Google of course never said or wrote such a thing.
> Google did not violate the SLA during this, apparently.
I agree.
Most concise summary of Layer I could find on the internet quickly.
> Layer is an amazingly elegant and light-weight solution for video communication. Layer is currently in a private beta primarily focused on Video, Voice and Chat on Android and iPhone. [1]
Comment was in 2014.
[1]: https://www.quora.com/What-is-the-difference-between-PubNub-...
Blake from Layer here: I'll definitely share your feedback with the product and marketing teams. Here's my version of a summary for a developer-centric audience:
Layer provides a comprehensive platform for adding rich messaging experiences inside other products. You can think of our offering as similar to iMessage or Facebook Messenger as a library / platform. We provide native SDKs on iOS and Android that provide a high-level development experience for implementing messaging. The SDK abstracts away all the low level details of implementing a great messaging system on mobile such as content synchronization and managing a persistent connection while still providing the developer with full control over the user experience. We also offer an open source UI toolkit called Atlas that provides a reference UI implementation on iOS and Android.
In addition to our mobile offering, we also provide a Javascript SDK for browser clients as well as raw REST and WebSocket APIs for other platforms. There is also a rich set of integration APIs in the form of backend to backend REST APIs and Webhooks for tracking events within the system.
The platform is fully managed and offered as a service. Historically our availability has been very strong, but last night exposed an achilles heal and we are working quickly to remediate the issues exposed.
If you just go to layer.com, the first text you see on the page does a pretty good job of spelling out what it is. At least, it did for me. It's also more up-to-date than that comment, it would seem.
Layer is just a building block for adding chat to your app. Similar to how you would use Elastic for search or Sendgrid for email.
That reminds me, I wonder what ever came of the Adria Richards v. Sendgrid issue.
> Everything you need, from UI to infrastructure, to boost retention, engagement or drive transactions with the power of rich messaging.
Wasn't enough for me. And if you click "Learn more" it's more marketing drivel. Granted my quora quote isn't much better.
Blake from Layer here. Have you taken a look at our developer documentation on developer.layer.com? I felt like we did a pretty good job of presenting the product capabilities. Our homepage and the developer documentation speak to different audiences. Let us know how the developer side matches up to your expectations.
Slightly OOT: Anyone know good alternative to Layer?
CometChat works seamlessly on web, mobile & desktop! Your users can be on any platform and communicate with each other.
Check out the demo here: https://www.cometchat.com/demo
Use https://matrix.org if you don't want to depend on the infrastructure deccisions of a third party.
PubNub. We have fantastic uptime and do over a trillion transactions a month flawlessly. We also do some cool blogging too. (my job :)
Firebase is a great platform for building chat functionality
Twilio's chat SDKs https://www.twilio.com/ip-messaging
You can try cometchat.com which is a self hosted. So no such issue.