Settings

Theme

Google Cloud outage brings down Layer

status.layer.com

60 points by wikyd 10 years ago · 55 comments

Reader

johnm1019 10 years ago

I am so turned off when I click a "Pricing" link and get a contact form. Even more so when I read that, "our pricing team" [will get back to you].

So, you have an entire team of people who will try and maximize how much I pay? Sounds like a great experience doing business with you. /heavysarcasm

  • homero 10 years ago

    I automatically skip any product where I have to speak to a human at any time

    • theprotocol 10 years ago

      Is it me, or are a lot of web-based service providers very chatty lately?

      I won't name and shame any particular ones, but I will say I've found myself regretting signing up for trials of certain services because of the almost sycophantic attention I'd receive from the oh-so-personable and friendly CEOs who make it a point to personally message all customers. I usually respond, initially, but then it quickly becomes pushy and intrusive, e.g. "Hi, I've noticed you haven't used [x] feature yet." "Hello? Are you getting my emails?" "Hello?"

      I don't mean to be rude, but I didn't sign up for the "omg you're so friendly and amazingly helpful" show. I just wanted to try the service out. Kindly stop breathing down my neck! :/

      • j_s 10 years ago

        This happens because it works, though not necessarily so much for the HN crowd.

        • bogomipz 10 years ago

          Does it though? In my experience a good product or service doesn't require constant spamming. It has nothing to with whether someone reads HN or not.

          • j_s 10 years ago

            Does your request for confirmation incorporate the ancient HN discussion I linked?

        • theprotocol 10 years ago

          I'd wager the initial greeting works, but I question whether the person's lack of self awareness (when it's clear that the customer doesn't have time to do small-talk and has been evading you for 3 weeks straight) wouldn't be grating to most people who are trying to evaluate several products and get some work done.

      • laumars 10 years ago

        I've started reporting the worst offenders as spam. One service was e-mailing me once a week. After the 3rd e-mail I vowed never to use their product out of principle. After the 5th I'd had enough of the harassment.

        • theprotocol 10 years ago

          The one I have in mind sent 4 emails in the same week, with a linear increase in bewilderment that I wasn't answering promptly beyond the first two (where I made it clear that I wanted to take my time).

  • bogomipz 10 years ago

    Agreed, that's where I close the browser tab. I also wouldn't shop at a store where merchandise wasn't priced.

  • redwood 10 years ago

    I wonder if companies trying to make money should invest their time in talking to you.

nwrk 10 years ago

Don't like the attitude. Pointing fingers doesn't help paying customers trapped by Layer poor design choices.

  • smt88 10 years ago

    Especially when they seem to be referencing only a single region. Multi-region deployments is the most basic protection against outages when using IaaS.

    • andyfleming 10 years ago

      They should at least be in multiple availability zones. Multiple regions often comes with a lot of challenges, but there isn't much reason not to be redundant in multiple AZs.

      • smt88 10 years ago

        > Multiple regions often comes with a lot of challenges

        As an almost-customer of Layer (before their massive price increase), they led me to believe that this was one of the problems they would be solving for me. Nowhere on their website does it say, "We save money by not following best practices, so plan accordingly for occasional outages!"

      • Artemis2 10 years ago

        Were they or were they not in multiple AZs? Developing for multiple availability zones is trivial when creating cloud-first software (and it's irresponsible not to use AZs!), multi-region comes with its own set of problems.

        • inlined 10 years ago

          In the report they say they're looking at moving to a new region, but Google apparently told them that us-central1-a was down. The "-a" makes it an AZ. It sounds like they're only on one AZ and may not fully understand the difference.

          [correction: they accidentally called usc1-a a region, but everything mentioned in their outage was a zone. They specifically called it a "deployment zone" not an availability zone, so it sounds like an issue of inexperience with best practices.]

          [obligatory disclaimer: I'm a Google employee. I don't have a relationship with Layer]

    • theDoug 10 years ago

      Yup. Just as too few companies realize that cloud doesn't mean one remote box replacing one local box, all eggs in one regional basket (or even single cloud provider) is unwise.

  • blakewatters 10 years ago

    Blake from Layer here: I've reviewed the updates from last night and I don't feel like the tone was out of line. We were simply trying to provide our customers with complete transparency about where the issue was and where we were in restoring service.

    With that said, we do feel that Google came up in short in their responses to us over the course of the issue. We pay handsomely on a support contract to get off-hours responses and issue escalations. The responses we received were hand-wavy and vague, leaving us without sufficient data to make decisions. We have raised these concerns with our Google representative and will be working with them to tighten our partnership going forward.

    We take full responsibility for this event and are working to cover the exposure. Building a system and business with resource constraints and complex distributed technologies is a long game of managing risk and trade-offs. We're human and we make bad calls along the way. We are very sorry and violated our commitments to our customers and their users. The entire Layer engineering team is head down right now working to make it right.

  • bogomipz 10 years ago

    That was my first thought as well. Why did they need to start migrating customers to another AZ? I hope their customer's started asking that as well. The title should be "Poor Design Choices Brings Down Layer"

knorker 10 years ago

> As we are now several hours into this outage and do not have satisfactory timeline for resolution, we have begun the process of migrating our hosts into another deployment zone within GCE

Wait, what? Isn't running in multiple zones something like rule #1 or #3 in "how to run in the cloud"?

So why did they not already do this?

mbesto 10 years ago

> As we are now several hours into this outage and do not have satisfactory timeline for resolution, we have begun the process of migrating our hosts into another deployment zone within GCE. We will have a baseline set of services migrated within the hour and evaluate our ability to operate in a split deployment. Should we need to pursue a complete migration of hosts across zones then we would expect another 4-5 hours to return to full operational capacity.

Wait, their service isn't setup to operate in a split environment out of the box? I think it's time SaaS companies start documenting their IaaS setup so purchasers can do a high level audit before they decide to use it for potentially a core part of their own product/service.

  • niftich 10 years ago

    I imagine if one were a customer of this SaaS, it's on the customer to ask what availability to expect.

    Clearly this vendor thought that their savings on their IaaS bill outweighed any operational or reputational risk they'd suffer from an outage at a lower layer (pun unintended).

    • blakewatters 10 years ago

      Blake from Layer here: We are forthright with all our customers about our current deployment configuration and the roadmap timelines for evolving into a deployment with higher availability characteristics. There is real complexity in operating a system such as ours in a widely distributed configuration and like any other company at our stage we regularly assess risks and make trade-offs. Sometimes we get things wrong.

      We are very sorry to all our customers for the downstream impacts their businesses. We came up short and are doing everything we can to make it right.

  • neom 10 years ago

    I agree. If you're going to use abstracted infrastructure but you don't understand basic distributed architecture you shouldn't really be blaming your cloud provider.

  • eitally 10 years ago

    Agreed, and to reply in the context of a critical comment someone further up noted (about never wanting to deal with any SaaS provider that required talking to a human) ... these are the kinds of reasons -- asking for architecture details or regulatory/security audits -- you NEED to be able to talk to humans, especially if you're operating in a regulated industry yourself or you're trying to sign an enterprise agreement with far-reaching consequences.

  • blakewatters 10 years ago

    Hey, Blake from Layer here: we regularly undergo architecture and deployment reviews with our customers. We are fully transparent with the current deployment configuration and timelines for revisions.

    Last night we lost a race to evolve our architecture and deployment ahead of a zone level issue that affected our total operations. We are working on it in earnest but there is very real complexity in operating a widely distributed real-time system.

  • tschellenbach 10 years ago

    I like how Algolia does that, https://www.algolia.com/infra (their blogposts and presentations go into much more detail)

    Currently thinking of creating a similar page for getstream.io, at the moment we always explain it during sales/onboarding calls. (we replicate our data to 3 different instances across multiple AZs)

flyt 10 years ago

It doesn't do any good to point the finger at your vendors when your service goes down; that data isn't useful for your customers. Never forget the lesson of http://www.whoownsmyavailability.com/

  • blakewatters 10 years ago

    Hey, Blake from Layer here: We were not at all trying to finger-point our issues at Google, only provide our customers with up to the minute transparency on where we were with the availability issue. We have received direct feedback from our customers that they do value detailed responses even when the news isn't great.

    I take full responsibility for the issues here and the team is working to remediate the exposure as quickly as possible.

  • ngrilly 10 years ago

    I'm not sure I agree. Customers like to know why it doesn't work. If it was a physical machine, they would have said something like "the disks are broken and we are replacing them". But it is cloud and they said "Google persistent disks are currently unavailable and they are fixing it".

    • knorker 10 years ago

      But the real reason is "we didn't set up our system properly".

      This is like saying "Hitachi Storage hard drives broke" when you actually mean "we didn't run RAID".

      • ngrilly 10 years ago

        You can't compare persistent disks failing in a whole zone, with a RAID array failing in a single machine.

        There is a reason why Amazon and Google takes EBS/Persistent Disk failures very seriously: there are not supposed to be unavailable during several hours, except if the whole datacenter is unable to operate (flood, fire, etc.), but it's not the case here.

        If your RAID fails, and you have a support contract which guarantees restoration within 1 hour, and it's not restored within 1 hour, then I think you can legitimately say something was wrong at your provider. It's not pointing fingers. Everyone does mistakes. It's taking responsibility.

        That said, I agree they should have run in multiple zones, as recommended by Google, if they need/want to avoid that kind of downtime.

        But I maintain Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future, instead of saying "don't point finger at us, it's supposed to happen".

        • jpatokal 10 years ago

          Two clarifications: the disks were not "unavailable", they had high latency (slow I/O) in one zone only (us-central1-a); and this affected only SSD PDs, not "regular" PDs. Per the SLA [1], it's "downtime" when PDs are completely unavailable for >5 minutes in at least two zones, and neither condition was met here.

          [1] https://cloud.google.com/compute/sla

          All that said, people choose SSD because it's faster and has higher throughput, so SSDs not being fast is obviously a real problem for applications relying on this, and rest assured we are indeed doing whatever we can to avoid this in the future.

          Disclaimer: I work in Google Cloud Support.

          • pdeva1 10 years ago

            this is a typical Google Cloud Support response (I used to host on GCloud). Stretching the definitions to somehow get out of responsibility. If the SSDs have super high latency, then for most purposes they are indeed 'unavailable'. There is a reason why the user provisioned SSDs and not a regular disk.

        • knorker 10 years ago

          > That said, I agree they should have run in multiple zones, as recommended by Google

          If you don't follow your vendor's recommendations for how to use their product, how can you blame them when that exact recommendation would have saved you?

          > Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future

          Sure. And the power to my office is not supposed to go out (and I've certainly worked in places where there has never been an unplanned power outage in decades), but if my business relies on it I need a UPS.

          > instead of saying "don't point finger at us, it's supposed to happen".

          It's not, and they shouldn't. Also unless you know something I don't, they didn't.

          > If your RAID fails, and you have a support contract which guarantees restoration within 1 hour,

          But as other commenter pointed out: Google did not violate the SLA during this, apparently. So…

          • ngrilly 10 years ago

            > > instead of saying "don't point finger at us, it's supposed to happen".

            > It's not, and they shouldn't. Also unless you know something I don't, they didn't.

            Sorry, my comment was confusing. Google of course never said or wrote such a thing.

            > Google did not violate the SLA during this, apparently.

            I agree.

ben_jones 10 years ago

Most concise summary of Layer I could find on the internet quickly.

> Layer is an amazingly elegant and light-weight solution for video communication. Layer is currently in a private beta primarily focused on Video, Voice and Chat on Android and iPhone. [1]

Comment was in 2014.

[1]: https://www.quora.com/What-is-the-difference-between-PubNub-...

  • blakewatters 10 years ago

    Blake from Layer here: I'll definitely share your feedback with the product and marketing teams. Here's my version of a summary for a developer-centric audience:

    Layer provides a comprehensive platform for adding rich messaging experiences inside other products. You can think of our offering as similar to iMessage or Facebook Messenger as a library / platform. We provide native SDKs on iOS and Android that provide a high-level development experience for implementing messaging. The SDK abstracts away all the low level details of implementing a great messaging system on mobile such as content synchronization and managing a persistent connection while still providing the developer with full control over the user experience. We also offer an open source UI toolkit called Atlas that provides a reference UI implementation on iOS and Android.

    In addition to our mobile offering, we also provide a Javascript SDK for browser clients as well as raw REST and WebSocket APIs for other platforms. There is also a rich set of integration APIs in the form of backend to backend REST APIs and Webhooks for tracking events within the system.

    The platform is fully managed and offered as a service. Historically our availability has been very strong, but last night exposed an achilles heal and we are working quickly to remediate the issues exposed.

  • JasonSage 10 years ago

    If you just go to layer.com, the first text you see on the page does a pretty good job of spelling out what it is. At least, it did for me. It's also more up-to-date than that comment, it would seem.

    • tschellenbach 10 years ago

      Layer is just a building block for adding chat to your app. Similar to how you would use Elastic for search or Sendgrid for email.

    • ben_jones 10 years ago

      > Everything you need, from UI to infrastructure, to boost retention, engagement or drive transactions with the power of rich messaging.

      Wasn't enough for me. And if you click "Learn more" it's more marketing drivel. Granted my quora quote isn't much better.

      • blakewatters 10 years ago

        Blake from Layer here. Have you taken a look at our developer documentation on developer.layer.com? I felt like we did a pretty good job of presenting the product capabilities. Our homepage and the developer documentation speak to different audiences. Let us know how the developer side matches up to your expectations.

wiradikusuma 10 years ago

Slightly OOT: Anyone know good alternative to Layer?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection