Avoiding Fallback in Distributed Systems (2020)

aws.amazon.com

54 points by omaras 4 years ago · 11 comments

Reader

This is the main valuable insight imho: "Distributed fallback strategies [can] ... in our experience ... increase the scope of impact of failures as well as increasing recovery times." (The ~strawman malloc analogy is not entirely convincing.)

But then again now we consider physical systems, say a spaceship, which require critical capabilities and operational regimes, and ask if fallback fault management is really a 'bad idea'.

sitkack 4 years ago

I am trying to be more positive in general, take everything with a grain of salt, I also work for a Big Cloud provider.
I read that as we work really hard to engineer crystalline fault lines vertically through our stack so the system has a nice clean single plane of fracture.
Given their track record of reliability and the unsubstantiated claims in the article, I can't even. In the real world, all the actions that have absolutely saved a system was an occurrence of fallback.
Having branch free code, one way to fail is nice from a reasoning perspective, and reasoning was more than one of the points brought up in the article. But reasoning is a goal that is different than reliability. I can use a reliable automatic transmission without reasoning about it.
Fallback fixes issues that failover doesn't. Rather put out a piece that encourages someone to not do something (sometimes this is important granted), encouraging folks to use immutability would be a larger global positive.
Immutability really does change everything.
https://cacm.acm.org/magazines/2016/1/195722-immutability-ch...
- EGreg 4 years ago
  
  I mean, I can definitely see their point. I work in distributed systems for a decade and I can tell you, when you kick the can downstream, it just gets worse later when it’s spread out and systemic.
  You should nip overloads in the bud, and not propagate them. Have backpressure be at the protocol level, and every node only deals with its neighbors.
  In fact, I would go so far as to say that the main reason for these failures is because we have monolitic, global addressing systems like DNS or IP routing tables, which let me send spam email to anyone, or DDOS a site from many machines at once. It’s totally discontinuous.
  What a good distributed system should have is be continuous in distributing capabilities. Each node can grant capabilities only to trusted neighbors, and revoke any that have been misused. Neighbors can then delegate some capabilities to others, or — if the node wants — forward an invitation to them, to become a neighbor.
  That would also solve all the issues about “real names policy”, and other crap like that. It shouldn’t matter whether you are “the real” Bill Gates or not. Your email shouldn’t be accessible to the whole world.
  And websites would also be stored using a FileCoin-type market, which recruits more machines as more readers SPEND MONEY using micropayments to access the files.
  Right now micropayments aren’t feasible, so instead we essentially have the publishers pay for hosting and collect micropayments via subscriptions and bundles.
- yuliyp 4 years ago
  
  Immutability doesn't really solve everything. It provides a cleaner path for retries for writes, but still doesn't handle situations where reads fail.
  I think the conclusion in the article ("don't do fallback") is misguided. Fallback code is sketchy, but sometimes it is worth it to take the time to write well-audited, well-tested fallback code to ensure a system which has high availability requirements can survive dependencies which are less reliable.
  - sitkack 4 years ago
    
    So we agree 100%! We should talk more.
gumby 4 years ago

> But then again now we consider physical systems, say a spaceship, which require critical capabilities and operational regimes, and ask if fallback fault management is really a 'bad idea'.
Their very example -- airport notice boards -- is an example of someplace where fallback is needed. The thesis of the piece is that management of fallbacks is complicated and painful and thus increase the scope of failure, as you observed.
In other words: fallback is often but not always required, and if you can plan to avoid it it may be better for you, depending on your application.
- PaulHoule 4 years ago
  
  I think of how the Space Shuttle had 4 computers running the same software and a backup computer running a simpler implementation of the control program.
  The flight control systems of civil aircraft like the A320 has failback modes to handle hardware failures such as a failed angle-of-attack sensor
  https://a320podcast.libsyn.com/flight-control-laws
  The 737 MAX crashed because it didn't have fallback modes.
  Engine Control Units in automobiles also have fallback modes. You shouldn't get stuck just because an oxygen sensor failed, even though that means the car will have trouble balancing clean emissions, performance and fuel efficiency.
  - gumby 4 years ago
    
    Years ago we had a customer working on the automated control system for the Vienna main train station. They only used two computers, but one was a SPARC and the other x86. One ran using a procedural language (CHILL) from the telecom world. The other implementation was written in a production language, perhaps Prolog. they were very concerned that an identical bug could be implemented in both implementations, hence the RISC and CISC architecture and the extremely different programming paradigms.
    WRONG: I believe the space shuttles started our with all the computers being LSI-11s. Presumably that was upgraded as the STS program continued!
    Hmm, I looked it up and actually they were older: standard IBM avionics computers designed in the mid 1960s. They were all the same design and as far as I can tell from a little DDG searching, they were never upgraded.
    I was so wrong I decided not to delete my mistaken observation.
    
    PaulHoule 4 years ago
    
    The shuttle started with an AP-101C that used core memory that was replaced midstream with an AP-101S based on semiconductor memory that was 3x faster. (Reference based on a link to a paper on a NASA web site with a busted SSL certificate.)
    System/4π derivatives were used for the target discriminating radar on the F-15 and quite a few other military applications.
  - jbn 4 years ago
    
    having 4 computers running software (the same or different software, it doesn't really matter) is known to not give you fault tolerance. See http://sunnyday.mit.edu/papers/nver-tse.pdf
letitbeirie 4 years ago

> physical systems, say a spaceship, which require critical capabilities and operational regimes, and ask if fallback fault management is really a 'bad idea'.
Depends on context obviously but IME as a controls engineer, what you want is a failsafe, not a fallback.
AWS calls a fallback when you "use a different mechanism to achieve the same result." Failsafes are all about returning the system to a stable and controllable state - if you can salvage the result that's great, but if it takes flaring off $10,000,000 worth of distillate to stabilize the system that's fine too.

Settings

Avoiding Fallback in Distributed Systems (2020)

Keyboard Shortcuts