Notes on Distributed Systems for Young Bloods (2013)
somethingsimilar.com> Find ways to be partially available. Partial availability is being able to return some results even when parts of your system is failing.
Question for HN: how do you define "system is working" in microservices (think usual containers, some load balancers, some message broker, some external integrations, some web frontends)? I've found that this question verbatim cannot be answered. There are so many failure modes of a distributed system, and some of them not even easily noticeable, that we can have a system degraded 90% of the time and users will never notice.
I agree with you that this question is unanswerable verbatim. It sorta reminds me of how Agile came up the "Definition of Done"
System maintainers / stakeholders themselves need to come up with a "Definition of Working"
Since distributed systems are basically in some state of failure/degradation almost all of the time, it is useless to try to say that "the system is working when there are no errors anywhere".
Some sort of threshold needs to be arrived at where we can say "it's working".
What that threshold looks like is going to vary from project to project.