Fictional Plumbing Problems As A Tortured Analogy For Software Engineering
latentcontent.netThis may be a tortured analogy, but it boils down into a basic problem:
1. You know there's a bug
2. You can't reproduce it
Several next steps come to mind:
1. Hire an outside expert who's dealt with this sort of thing before. They may be able to theorize what's going on and come up with a solution.
2. Install measures that don't prevent the problem but prevent the damage. For example, an emergency failsafe that shuts down the system or relieves the pressure when the incident occurs, thereby preventing the damage. This is why we electricity has fuseboxes! Error management is sometimes the only option, because 100% error prevention is impossible.
3. Install monitoring that tracks a lot more details then you are currently getting. When the next error occurs, you will know a lot more and may have the information needed.
Edit: What's the name of the theory in networking that 100% error prevention is not possible, so error handling is the only option? There was a great article on HN about it a few years back.
My experience has almost always led to #3 being the most workable solution, but not a perfect one.* #2 should be incorporated into any project, but it presumes that you know all possible ramifications of incorrect operation. An electrical breaker works because complete non-operation is generally better than death. For many software companies, complete non-operation is a precursor to death.
#1 is almost never a good solution, namely, the amount of time it would take for them to become familiar enough with the codebase to not aggravate your existing engineers would exceed several iterations of #3, and also because I've rarely met an outside expert whose solutions didn't involve re-writing everything to meet their expectations of "correct implementation," this could be a sample selection problem on my part, however.
* - How do you know that you are monitoring the correct component? This path usually leads to multiple monitoring development tasks as you find where you thought the problem was sourced was a in fact symptom, and you continue adding more monitoring options as you get closer to the source. This is why I almost always add an insane level of logging to any application, and control the verbosity through runtime controls.
Brief non-operation (reboot / service restart) is often better than a prolonged outage. Particularly where SLAs are set to create an expectation and acceptance of this, and where redundancy exists.
I'm thinking too that there's a feedback process at work here, and some sort of damping mechanism would help with that.
Agreed, and many architectures are designed to have components "transparently fail" without impact to overall operation. When you have forced failures, feedback/damping is absolutely required. However, (my experience dictates) that most such failures are unplanned and unknowable at the outset, and you can only dampen conditions which are predictable.
Maybe programming is tortured analogies all the way down? (Not really, but there are some over-engineered code bases that feel like it.)
No, it's turtles.
Tortured turtles.
Actually, it's tortured turtle analogies all the way down.
Well played, sir.
IN the apartment building, we have a known problem, a high severity attached to it, an unacceptably high incident rate, and no idea of the exact conditions necessary to replicate it.
At this point I would do two things:
1. log all the things.
2. find me my top QA person, the one who can find bugs that nobody has yet reported. Put her on it.
OK, everybody knows that logging is good. And everyone knows that QA is good.
What I have found, though, is a number of companies who think that QA is best done by the developer who wrote the feature... and I think they are absolutely wrong in every sense, except possibly short-term economics. Having someone do QA who has none of their ego invested in the code is essential.
The problem here is that there is very little, if anything, as complicated as software. Preventing leaks like in the example is not that difficult -- you put in pipes that can handle a lot more than the required load, because it is unacceptably expensive to have them burst and the better pipes are not that much more expensive (putting them in is).
Nope. I actually live in a luxury high-rise, and while the pipes don't leak, the pressure and temperature is about as bad as the OP describes.
No bidets, though.
I was hoping for an analogy to explain how difficult it is to estimate long-term programming work due to unexpected "black swan" details popping up as you get into the work that add considerable effort to the project. It's a situation I find I need to explain often, and in layman terms, so a perfect analogy would be great...
I think that's been beaten to death with http://www.quora.com/Engineering-Management/Why-are-software...
Redo the bathrooms using different layout and components.
"If at first you don't succeed, refactor."
Or, you can pivot... Turn the bathrooms into fishtanks!