Settings

Theme

Type in the exact number of machines to proceed

rachelbythebay.com

554 points by vii 5 years ago · 339 comments

Reader

csmattryder 5 years ago

I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.

I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.

[1] https://en.wikipedia.org/wiki/Pointing_and_calling

  • brundolf 5 years ago

    I've only been in the job field for six years, and yet:

    My first boss accidentally deleted our QA database, meaning to delete a local copy

    A later boss accidentally deleted our production database, thinking it was the clone that he had just made (which luckily we still had)

    Both of them were very experienced developers in their 40s. Nobody is beyond this kind of mistake.

    • jlmorton 5 years ago

      War story time. Long ago, I worked for an interesting company that insisted on running its entire business on Linux desktops, all the way back between 1999-2002. Imagine running StarOffice/OpenOffice, Thunderbird, Netscape Navigator, etc, for your entire business back in 2000, including your executive team, marketing teams, everyone, most of whom had never even heard of Linux before.

      Anyway, this being Linux, everyone's home directory was mounted on NFS. All our builds were standardized with a tool called SystemImager, which we could use to push out updates to everyone's desktop whenever we wanted. If there was a new version of KDE, we could pretty easily push that change out.

      Sometimes it was convenient for me to work on updates to these images by chrooting into a directory containing the "image," which was really just an rsync tree. And sometimes, when updating these images, it was convenient to mount our NFS home directories in this chroot environment, so I could access things like an archive I had just downloaded on my own desktop.

      And eventually we had lots of different images, and the old ones were using up a lot of disk space, so I decide to clean up some space removing the old images. And these are fairly large images, with lots of small files, and this was before SSDs were a thing, so it made sense that deleting them was taking a while, and I stepped out to grab something to eat.

      As I was eating lunch, I started getting the tech support escalations. But this wasn't that unusual, our users routinely had problems with the environment we had provided. They hated it, because it was in many ways terrible, and they made sure we knew it. So I wasn't terribly alarmed. I didn't think any major changes had been made, so I didn't hurry back.

      By the time I leisurely returned from lunch, half the NFS home directories for our users were gone, along with all their documents, emails, bookmarks, or whatever else. Suddenly it hit me what had happened: at some point, perhaps months earlier, I had left our NFS home directories mounted within one of these image chroots. And now I had sudo rm -rf'd it.

      We had backups, but they were on tape, and it took several days to restore, with about a day of data loss.

      • abrookewood 5 years ago

        That sinking feeling and cold panic when you realise what you've done. God that is horrible.

        • gomox 5 years ago

          My favorite version is when that UPDATE or DELETE SQL query that you expected to finish instantly takes a few seconds before giving you your cursor back.

      • dasyatidprime 5 years ago

        You probably knew this already, and there's probably better solutions if you're not in the manual sysadmin world, but after I did that on a personal machine a few decades ago (I think it was?), I got in the habit of using `--one-file-system` when doing major recursive rm operations that weren't meant to cross filesystems. Or `find -xdev … -delete` for anything more selective.

        • dmurray 5 years ago

          It seems better to alias rm to "rm --one-file-system", assuming major cross-filesystem deletes aren't something you do all the time that should be made as ergonomic as possible.

      • coredog64 5 years ago

        Similar story, except we were using an NFS appliance that took hourly snapshots. As soon as we figured out what was happening, we had the storage team save off the latest snapshot. It was 1TB of data (a lot for the time) and took a week for us to restore.

      • aprdm 5 years ago

        A lot of companies still work in a similar fashion to what you described, maybe with root squashed, but still, very possible to have something like that happen now a days!

        I remember someone hit a bug with docker exec --rm years ago where it started deleting some NFS files that it shouldn't...

      • Huggernaut 5 years ago

        This reminds me of a time when a colleague and I were investigating some persistent D-State processes that were occurring when container processes were being exec-ed.

        Once on the box, we wanted to create a container with utilities in the fs but didn't want to download an image tarball or look through the rootfs layer directories for one to use, so we just bind mounted host root onto another directory, beside the config file we were using.

        This worked like a charm. Until we rm -rf'd the config directory and deleted host root in the process.

        In our case, fortunately the consequences were minimal as all workloads were stateless. The container scheduler moved all the workloads to other hosts and the host scheduler noticed this VM wasn't responding any more and rolled a new one. The whole thing resolved itself in about 5 minutes with no interaction from us - so that was pretty neat.

      • stmw 5 years ago

        That's a very sad worry story, hope it turned out OK. Sorry you and the users had to go through that.

      • eljimmy 5 years ago

        Oh man - this one is anxiety inducing. I feel like this would haunt me for years.

    • brlewis 5 years ago

      >very experienced developers in their 40s

      I'd say they were experienced developers. Only after accidentally deleting databases were they very experienced developers.

      • PopeDotNinja 5 years ago

        I once cloned a directory for standing up an environment via Terraform. I modified all of the environment variables and config and ran it. It worked perfectly. Except I’d forgotten to wipe out the Terraform state, which meant that in the process of creating a new environment, it completely deleted the environment I had cloned. That was my initiation into very experienced :)

      • BalinKing 5 years ago

        This may not have been their first time, though :-P

        • __d 5 years ago

          Some time ago, it was common in Unix sites to have an NFS filesystem mounted on all machines that contained locally-built binaries to augment those provided by the operating system. At this site, we used a bunch of different platforms: OSF/1, Solaris, Linux, HP/UX, etc. So we had a large filesystem containing the source code, and built binaries for all the different platforms, and this included heaps of things, from Bash upwards.

          A colleague of mine accidentally ran rm -rf on this filesystem.

          It was taking a loooong time, so he realised and killed it, but not before it had removed a heap of stuff. Because this was something that could be rebuilt, it wasn't backed up, so we had to go through the process of downloading the tarballs, and recompiling everything for all the different platforms. It took a few days to recover most of it, and weeks to completely restore things.

          The day after the incident, when he arrived at work, he found his keyboard was missing a few keycaps. It took him a while to realise that there were four gone: 'R', 'M', '-', and 'F' ...

          Good times.

    • mehrdadn 5 years ago

      Reminds me of when I accidentally deleted a virtual hard disk I had a few years ago, because I'd copied it earlier and I thought I still had the other copy left. Only afterward did I remember I'd done the exact same thing to the other copy earlier... thankfully the information on it wasn't critical, but it was kind of terrifying to realize it very well could have been.

    • sgustard 5 years ago

      I have been that boss. Is that you, Wendel? In any case: the deletion even had a "type your app name to confirm" prompt, but I knew I wanted to act on production; the issue was deleting the wrong one of multiple production databases. The takeaway was to grab a second pair of eyes to review any dangerous operations.

    • aidenn0 5 years ago

      I deleted our production CRM database meaning to delete the test database. While my boss was running queries on the database for setting my quarterly bonus.

      Good news is that I was deleting the test database to ensure that the recovery from backups was properly automated, so it wasn't down too long.

    • davedx 5 years ago

      Yup. Senior dev here, my own devops config screw up wiped out all production sales order data earlier this year. Had to restore from multiple backups, took a while. Stressful experience.

      Consider network partitioning so dev/test/accept just has 0 contact with prod.

    • contravariant 5 years ago

      Ironically there seems to be no time more prone to these kinds of mistakes than when you're trying to prevent or fix them.

      • jrott 5 years ago

        Most of the worst production issues I've been involved with have come from trying to fix a minor issue and then somebody making a mistake. The way our brains are wired to handle stress isn't really useful for debugging complicated problems.

        • Aeolun 5 years ago

          One of the best pieces of advice I’ve ever gotten from a manager (in regards to production issues):

          First, calm down.

          I’m still amazed that he could be so calm when I’d just deleted a bunch of stuff on a clients production environment.

          May not have been the most lucrative company I’ve ever worked for, but it was definitely the best one.

  • myself248 5 years ago

    Ever since hearing about point-and-call, I've started using it in the kitchen when turning on the stove. I used to destroy one or two pans a year by turning on the wrong burner, but it's now been about a year and a half and I haven't screwed it up yet.

    The knobs are labeled with a terrible little glyph meant to indicate which is which, and I've supplemented this with plain-english Brady labels "front left", "front right", etc. Now I speak the words above the knob, and point to the burner. It felt goofy at first, but now it feels normal, and like I'm tempting fate if I skip it.

    • giantDinosaur 5 years ago

      I'm curious how exactly you managed to destroy pans. I've never destroyed a pan in my life, and take no particular precautions - is this a common thing? Is this more common with non-stick stuff or something?

      • ajb 5 years ago

        Not the op, but non stick pans will burn if the pan is heated while empty.

        • pantalaimon 5 years ago

          I think non-stick pans are a fad. A well greased iron or steel pan works much better and is impossible to destroy.

          • rpeden 5 years ago

            Can non-stick pans even be a fad when Teflon coated cookware has been popular for 60+ years?

      • myself248 5 years ago

        The non-stick ones especially, but even plain metal pans will warp if they get hot enough. And then they don't sit flat on the burner, which might not matter on a gas stove, but contact with an electric burner is pretty important.

    • dirkt 5 years ago

      Not sure how it is in other countries, but don't the knobs when going left-to-right always correspond clockwise to the burners, starting at the lower left? And the oven knob is to the right?

      I've never seen a different arrangement.

  • MaxBarraclough 5 years ago

    Worth mentioning that, assuming the single study on the matter can be believed, the pointing and calling method is extremely effective in reducing the incidence of silly mistakes (that is, mistakes made in simple routine tasks, by competent individuals).

    Unfortunately, it strikes many as looking rather silly, so it hasn't been widely adopted.

    • js2 5 years ago

      I learned a technique from a gray beard[0] when I worked as a student sys admin for the CS dept over two decades ago. Whenever typing a destructive command, he'd take his hands off the keyboard and drop them to his side, re-read the command, then put his hands back to press enter.

      I do this whenever I'm on a production server (which is rare anyway). I use different colored prompts for local and remote shells.

      [0] Technically he had no beard and if he had, it wouldn't have been gray.

      • encom 5 years ago

        Re: Beards, color of:

        Mine started turning grey in my mid 20s.

        Could be related to me doing the electricians equivalent of deleting production DBs. I've drilled through the comms cable to payment terminals during opening hours. I've run over a copper gas line with a scissor lift. And yes, I've cut live 230V cables with hand tools.

        That sinking feeling in your stomach you get immediately after doing something bad - it's universal across professions.

        Thankfully, I've never fucked anything major up, and I've had my hands in hospitals, power plants, ISP fiber backbones, police stations and whatnot.

        • MaxBarraclough 5 years ago

          This reminds of a quote from, I think, Discworld.

          > You're a survivor.

          > But I've nearly died, dozens of times.

          > Exactly.

          • andreareina 5 years ago

            Sounds like Rincewind. See also the octogenarian barbarians who are so deadly precisely because they've had a lifetime of experience of not dying.

            • mangamadaiyan 5 years ago

              That would be Cohen the Barbarian (aka Genghis Cohen) and his cohorts, collectively known as the Silver Horde.

        • Xylakant 5 years ago

          > I've drilled through the comms cable to payment terminals during opening hours.

          A friend of mine who does fire alarm systems was tasked to install one at a bank branch. He found out the hard way that one of the cables for the safes safety system wasn’t in the place where it should have been according to the plans. Safe’s safety system hosed, bank branch closed for repair.

      • brundolf 5 years ago

        Different-colored prompts for different machines is a great thing to do (I've been doing it for years), and very easy to implement

        • MaxBarraclough 5 years ago

          Solid tip. For GUI-enabled servers, use distinctively coloured wallpapers. I recommend bright red for production machines. The image itself can be just about anything, provided the colour is clear.

          Doesn't hurt to use an image that's related to the server's purpose, and to put the name of the server right there in the wallpaper somewhere.

          • brundolf 5 years ago

            That's nifty, but sounds like more effort than changing a single color in one's .bash_profile

            • gknoy 5 years ago

              Using iterm2, you can set a "badge" (large text overlay) on a terminal tab. I have a short shell function (`ib foo`) that sets the badge to arbitrary text. It's NOT as good as setting the terminal theme, but it's still very helpful to use it like this:

                  ib production && ssh production-machine
                  ib demo && ssh demo-machine
              
              It's definitely helped me when testing a fix on a demo or staging instance, and has helped me avoid doing it on production accidentally.
            • MaxBarraclough 5 years ago

              That's true, but depending on the configuration it may benefit all future users of the server.

        • TylerE 5 years ago

          Abel prompt that shows host name, username, and git branch

      • morelisp 5 years ago

        A similar tip I picked up long ago: If you're typing a dangerous command, first type a `#` (or `--` if it's SQL, etc.), then the command. Then read it. Then go back to the start of the line and remove the comment and run it.

        • grossvogel 5 years ago

          I always do destructive SQL commands in two steps: first run a select using the WHERE clause you intend to use and verify which records will be affected, then hit the up arrow and edit the beginning of the query leaving the WHERE intact.

          I also like adding redundant conditions to the WHERE so a typo in any single one of them won't sink me.

          • tatersolid 5 years ago

            For the rare but critical manual SQL mod our common safety measure is to wrap every DELETE or UPDATE in BEGIN TRAN...ROLLBACK TRAN first. Run on test systems or snapshots multiple times, checking the result inside the transaction.

            Finally, change ROLLBACK to COMMIT only when you are positive all is well.

            • mjevans 5 years ago

              IIRC (without checking the manuals) data-definition commands might not be covered by such transactions: such as altering, dropping tables and possibly truncates.

              • andreareina 5 years ago

                PostgreSQL is quite good about DDL being transactional. So I was surprised (tbf, I shouldn't have been) when Redshift autocommitted after a TRUNCATE. But DROP TABLE is transactional, go figure.

              • tatersolid 5 years ago

                DDL is transactional in Microsoft SQL Server as well.

          • maynman 5 years ago

            I do the same thing. I also keep auto commit off and make sure the rows updated looks correct before committing the change.

        • MaxBarraclough 5 years ago

          A related Bash command: alt+# to prepend a hash symbol and submit the line, so you can return to it later (through Bash's history) to run it.

        • Fradow 5 years ago

          I use an alternate version on SQL: when running any modification on any kind of sensible database (which is a bad practice in itself, obviously, but sometimes you don't have a choice), always type in the WHERE clause before the table name (added bonus: do a SELECT first with that clause to see what you are modifying).

          That way, if you accidentally send it, the command fails and nothing happens.

        • haimez 5 years ago

          For SQL, always BEGIN first. If you’re unsure, run it as an EXPLAIN first. Then fire. Then commit it or roll back.

        • rkagerer 5 years ago

          Especially useful if you're remoted in over a laggy connection.

    • morelisp 5 years ago

      I've done this for several years (also after seeing a video about Japanese railway operations). It doesn't seem to catch on.

      It's also not perfect; it does not catch mistakes concerning "non-local" state, e.g. configuration files in /etc merging with one in . merging with some command line options. (Personally I try to avoid writing tools with defaults of this sort, but especially Java developers seem have different opinions.)

      Unfortunately if you do P&C and still make the mistake due to the aforementioned tooling, you look even stupider.

      • myself248 5 years ago

        Around industrial machines, I've long held and promoted the view that the machine is _trying_ to kill you, _trying_ to damage itself, _trying_ to ruin the workpiece. Only by outsmarting it at every turn, and having safeguards against every mishap, can you go home at the end of the day.

        When something happens despite all that, just step back and realize how much worse it could've been, and how successful your safeguards have been up 'til that point.

        Then look carefully at the procedure. Is there something about the naming or structure that could be more clear? Can you think of near-misses that resemble the failure you just experienced? Are you using boobytraps in production? Symlinks and overlay filesystems seem clever in the moment but they're bound to subvert our intuition someday. Perhaps you should get in the habit of always using full absolute paths, for instance.

        There's always another gotcha, but if your workflow doesn't look as over-the-top safety-silly as aerospace, you're not doing as much as you could be. (Hint: It's not silly.)

    • blantonl 5 years ago

      Watch and listen to pilots as they complete checklists. They point and callout each item, switch setting, etc.

      • waterhouse 5 years ago

        I searched Youtube for examples of this. This is a little bit staged, but it seems to be a real checklist they're going through: https://www.youtube.com/watch?v=JG7SkOQDDt0

        Though they're not perfect. They said that one pilot is supposed to read the item, the other pilot say the answer, and the first pilot visually confirm it; but at 1:42, I noticed the first pilot say "emergency exit lights", hear the confirmation, and move to the next item without her eyes moving away from the list.

        I'm not sure which of several possible conclusions to draw from that. ("Humans suck", "it is indeed staged", "the procedure has enough redundancy that the chance they're both careless on a given step is small", "the pilots feel that the emergency exit lights aren't particularly important", ...)

        • andi999 5 years ago

          Routine is the killer. Have a look at the fatal maglev train accident in Germany. Service car was on the track. Presence of service car in service bay (and not on track) can be seen visually by operator (driver) in the control centre when turning head. (If I remember correctly)

      • morty_s 5 years ago

        Came here for this.

        A: “Passing control”

        B: “Taking control”

        A: “You have control”

        B: “I have control”

        This is how I remember it (6174, UH-1Y).

        • rurp 5 years ago

          Rock climbing is remarkably similar. When a climber begins up a route the standard exchange with their belayer (the person managing the rope and keeping them alive in a fall) goes something like

          A: "Belay on?"

          B: "Belay on"

          A: "Climbing"

          B: "Climb on"

          Then the climber begins.

          It's interesting to me that highly regulated and totally unregulated activities have evolved extremely similar processes. I suppose having your life on the line is a good motivator to follow best practices.

          • Talanes 5 years ago

            For the ultimate low stakes version of this, when I played World of Warcraft in my younger days, tank swaps would communicate the same way.

          • chipsa 5 years ago

            "On belay?"

            "Belay on"

            Swapping the order of the words helps further.

        • BalinKing 5 years ago

          As a fun fact, I was taught a shorter version (during private pilot instruction):

          "You have the control."

          "I have the control."

          IDK if it changes between aircraft types, commercial/private/military cultures, or if it's just coincidence.

          • pacaro 5 years ago

            Or as we learned from Sully, when you're the captain, and you take charge and responsibility quickly, a simple "My plane" suffices

          • regularfry 5 years ago

            Yep, I had this version in a military trainer.

        • cortesoft 5 years ago

          The TCP handshake IRL

        • euler_angles 5 years ago

          This kind of positive assertion handover of control is still standard for very good reasons.

      • staunch 5 years ago

        And pilots will even callout that their action had the desired effect:

        "Flaps up selected"

        "Flaps are indicating up"

        There's a lot to learn from the way airplanes are engineered and operated.

        • nemosaltat 5 years ago

          Prior Navy Nuke here. We called it PRO (Point, Read, Operate)- we’d point at the thing we were going to manipulate, state what we were manipulating, and announce the completed action.

          For certain procedures we had a second party (“reader”) observing and acknowledging each part of each step.

          Operator (Gesturing anti-clockwise while pointing at valve XYZ) Operator: Opening valve XYZ. Reader: Opening valve XYZ, aye. Operator: Valve XYZ is open. Reader: Valve XYZ is open, aye. Operator: Indications of flow Reader: Indications of flow, aye.

          People can still get complacent, and things can still get missed but the deliberate mentality goes a long way. Now when GitHub makes me type out the repository name before I can delete it, I sometimes copy/paste... YOLO.

          • AtlasBarfed 5 years ago

            I've noticed from pair programming that the person navigating with a mouse is far less able to read and interpret their surroundings or pick up typos while typing, than an observer that simply has to watch what the other person is doing.

            Like when clicking on a file in a directory you just entered and looking for the file, the observer can literally locate and point to the file for the mouse user 5-10x faster than the mouse operator.

            The observer seems to interpret the information that results from the directory listing faster than the person who just did the double-click to enter the directory because they don't have the muscle coordination context switch and can immediately move to interpreting the results.

            It's probably because mouse manipulation uses brain infrastructure that is more recently evolved, but observe-react is a lot earlier in the brain processing pipeline evolutionarily, and a lot more refined/involved.

            • Fr0styMatt88 5 years ago

              Since I have a vision impairment, I'm sure the effect is amplified very much for me, but using the mouse is such a massive break in flow:

              - First you have to lift one hand up off the keyboard and put it down on the mouse. This may or may not mean taking your eyes off the screen.

              - Then you need to find the mouse pointer on the screen

              - Then you need to aim for what is usually a relatively small target and move the pointer there.

              - If you're right-clicking, the right-click menu usually presents more small targets you need to aim for.

              - If you need to use the keyboard, again you have to move your hand over to the keyboard from the mouse.

              For finding the pointer, I developed this unconscious habit of slamming the mouse pointer to the very top-left of the screen. It's difficult though when on someone else's machine, where your brain isn't used to the pointer velocity or where multi-monitor means that slamming the mouse to the top-left actually puts the pointer on another monitor.

              People look at me in awe when I'm using a two-pane file manager but honestly not having to take your hands off the keyboard and not having to move your eyes off the screen gives so much better flow. It's also why I like the UI of Blender - one hand on the keyboard and one hand on the mouse at most times.

            • yobert 5 years ago

              I think this is because writing software is so much more than operating switches and controls. I really hate pair programming for this reason, but I love industrial-style controls and protocols involving multiple people.

              • robaato 5 years ago

                Back in the '80s I worked on a financial system (SWIFT interface) for an Italian bank. It went operational and we observed 2 operators effectively doing "pair operating". We just thought it was weird Italian style socialising - one had the keyboard and the other was chattering away with a commentary. But they were surprisingly effective!

                I accidentally learned when teaching a course at a site with too many people for the available machines, that pair exercises was very effective - I got lots more questions and overall learning went way up. If the pair discussed it and couldn't find an answer they would have the confidence to ask. On their own, neither would probably bother and just wait for me to go through things.

        • Sharlin 5 years ago

          And it should be kept in mind that almost none of those procedures were intuitively obvious things to do. As the saying goes, safety standards are written in blood.

        • quercusa 5 years ago

          And then there's the interactions with the Air Traffic Control system: flight plans, Standard Approaches, charts, etc. It's very impressive.

  • acdha 5 years ago

    Back when I shelled into servers more, I really liked having my deployment put the environment in the prompt and set a red background on production for similar reasons. It only takes a small change to jar you out of habit.

  • YeGoblynQueenne 5 years ago

    >> I personally took it to heart, it's a good system for forcing a cache miss in the brain - make sure you're on "database production" or "database localhost" etc.

    Yeah, ouch. More ouch if it's the other way around- you delete the test database and it's not the test database.

    (long story)

    • kbenson 5 years ago

      > you delete the test database and it's not the test database.

      > (long story)

      I think you can skip the long story, as most of us can tell a story similar in theme if not specifics (and sometimes, probably some similar specifics too). ;)

      With great power comes great responsibility (to not completely screw stuff up because you were on autopilot for a second...)

    • throwaway894345 5 years ago

      I worked at a company where someone deleted the production database by accident and the snapshot mechanism hadn't been working AND the alerting for the snapshot mechanism was also broken. Fortunately someone had taken a snapshot manually some weeks prior and they were able to restore from that and lose relatively little data (it was a startup, so one database was a big deal, but weeks worth of data was not such a big deal).

      • txutxu 5 years ago

        I worked at a company were someone deleted the production RDS and all the snapshots.

        Typing the confimation and requesting to delete the snapshots.

        He had two brosers open, one for development (of cloudformation, etc)... but someone did ask him to change a thing in prod.

        Both browsers were identical. Only the account in the top right corner did change.

        Both cloudformation stacks were identical (instance names, etc).

        He had been all the morning launching and deleting the dev environment.

        Team mates were joking loud around his table before the moment it did happen.

        Sadly, he got fired (the company was proud of it's cost savy choices, didn't have other backups than a few days of snapshots, probably CTO choice).

        • Gene_Parmesan 5 years ago

          Firing the person who happened to be at the wheel when a mistake like this occurs never seems like the right choice to me, especially if their performance to-date had otherwise been good.

          Everybody has off days, or just instances where circumstances misalign in just the wrong way. To pretend otherwise is silly; instead, it's the leader's/team's responsibility to ensure that those sort of off days don't lead to massive losses via redundancy & the sort of measures we're talking about here & in the OP. Firing somebody in these circumstances just acts to severely reduce morale, since we all secretly know in our hearts that it very easily could have been us.

          Firing in this case just seems retributive. It's not going to bring the lost data back, and you've just eliminated the very person who could have told you most about the chain of events leading to the incident in question to help you guard against it in the future. These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues. A lack of team focus on reliability/quality, a lack of communication or trust about decisions made (or not made) by higher ups, or so on.

          And they are probably the single least likely person to cause a similar incident again -- that person will now likely be double and triple checking their commands for eternity.

          • jacobsenscott 5 years ago

            Agree. There is never a single cause to this kind of error. It takes a village. Someone didn't name things properly, someone else didn't store backups properly, someone else gave everyone root access to production, etc. It was inevitable the database would be deleted - doesn't matter who actually did it.

            If your CTO scattered those landmines all over then "not stepping right" is not an error. It just sucks.

          • greedo 5 years ago

            Sometimes. And sometimes they make the same mistake over and over.

            We had an admin in charge of our storage. He had worked with our old vendor's SAN for years, then we got a new SAN. Trained him/certified him etc. He "accidentally" shut down the entire SAN. That brought down the entire company for over 9 hours.

            Fast forward two years later, he screwed up again and caused a storage outage affecting about 1100 VMs. Luckily not much data loss, but a painful outage.

            Then a month ago, he offlines part of the SAN.

            Some people never learn, and recognizing this early is usually better than letting someone continue to risk things.

            • mehrdadn 5 years ago

              3 mistakes in... >2 years? I feel like it's really hard to tell if the problem is really the person at that point. Have you had others perform the same job for a similar duration to see if they avoid the same mistakes?

              • nitrogen 5 years ago

                If you made a list of every mistake each person makes in 2-3 years, and omitted all other detail, pretty much everybody would look like a terrible person. Context, frequency, etc. all matter.

                If particular systems or people are seeing a high frequency of mistakes, maybe the system design is at fault, not just the person. Obviously it's hard to do in practice, but the ideal is to design systems that are mistake proof.

              • greedo 5 years ago

                This is just the mistakes made in the SAN/Storage part of his responsibilities. As we used to say in World of Warcraft, "Can't heal stupid."

            • jodrellblank 5 years ago

              > "He had worked with our old vendor's SAN for years, then we got a new SAN."

              Great way to invalidate years of experience. Presumably from your telling of the story, he didn't cause problems with the old vendor's SAN?

              > "He "accidentally" shut down the entire SAN."

              So, was it an accident, or was it an "accident"? You can't have it being a mistake if you're also hinting it was deliberate and malicious.

              • greedo 5 years ago

                He was trained and certified on the new SAN, and surely some of his prior experience on the legacy SAN would translate. Just as moving from AIX to RHEL/CentOS wouldn't invalidate all your skills and experience.

                It was a real accident when he shut down the SAN the first time. I don't know why I put it in scare quotes.

          • Lex-2008 5 years ago

            > These incidents usually sound simple at the surface level ("I clicked the button in the wrong window") but often hint at deeper, perhaps even organizational, issues.

            These words reminded me a story of similar/different "flaps" and "landing gear" controls on a plane - where crashed airplanes were also blamed on pilots first, before a trivial engineering/UI solution was implemented: https://www.endsight.net/blog/what-the-wwii-b17-bomber-can-t...

          • Huggernaut 5 years ago

            Nickolas Means has an absolutely wonderful set of talks on themes like this. Particularly relevant here I think, is his talk: "Who Destroyed Three Mile Island?" - which goes through the events that occurred at the nuclear power plant, the systemic problems, and how to find the "second stories" of why failures occurred.

            https://www.youtube.com/watch?v=1xQeXOz0Ncs

          • _asummers 5 years ago

            There's a really good book describing this phenomenon called Behind Human Error. It speaks of "first stories" and "second stories" and how in analysis of incidents, it is all too common to stop at the first story and chalk it up to human error, when the system itself allowed it to take place.

        • ahoka 5 years ago

          "Both cloudformation stacks were identical (instance names, etc)."

          This is why it's a good practice to include the environment name in the resource names when it makes sense. Even better, don't append the env name, but use it as a prefix, like ProdCustomerDb instead of CustomerDbProd. I also like to change the theme to dark mode in the production environments as most management UIs support this. One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.

          • greedo 5 years ago

            I have my background colors configured for each environment so when I'm shelled into a server, I know exactly what I'm working with.

          • nitrogen 5 years ago

            One other neat trick is to color code PS1 in your Linux instances, like red for prod, green for dev.

            This is definitely a nice one to add. Though I did work with someone once who believed that all servers should be 100% vanilla and reverted my environment colors.

            In container-only shops with no ssh, this is less of an issue, and instead you rely on having different permissions and automations for different environments.

        • YeGoblynQueenne 5 years ago

          That's very similar to what happened to me - except I didn't delete any backups, thank the Great Old Ones. And I didn't get fired.

          Basically, I had a habit of starting a new SQL Server Management Studio instance in its own window for each database I was working on. At some point this struck me as wasteful, for some reason, so I closed all my windows and opened all the databases in one window. Then sometime after that I went to delete the test database as a routine maintainance task, but of course I was used to clicking the database at the top of the left pane in SSMS, which was the test database when it was the only database in a window... but now happened to be the production database. Then five minutes later I got a call from the client company that used our system, to ask me if there was any maintainance going on because everyone's client had just crashed.

          The horror when I realised.

          It was educational, though. I don't think I'll make that particular mistake ever again. And my bosses were ace to be fair, probably because I worked my ass off to correct the mess that ensued.

        • shezi 5 years ago

          When I worked in production environments, I used to set up little Firefox userscripts that would add a banner or anything visual to the production site. It's entirely client side and easy to customize.

  • dheera 5 years ago

    > I've seen this called "pointing and calling" [1], Japan's train drivers use the technique to force themselves to perform actions and take notice of the current environment.

    The concept makes sense, though I don't quite fully get how to translate it to other contexts besides train driving where unexpected and unpredictable events come up all the time. Let's say you're driving a car and the traffic light turns red. Do you point at the traffic light, say "red", point at your brake pedal, say "brakes", and then hit the brakes?

    • apozem 5 years ago

      In high school, I drove a 1993 Toyota Tercel. It was a functional, reliable car, but it had no keyfob to lock the doors remotely.

      Getting out of your car, pressing the lock button on the inside of the driver's side door, and shutting the door are all routine, boring actions that make it easy to forget your keys inside the car. The keys can go in all kinds of places as you climb out of the car - jacket pocket, pants pocket, center console. It is very easy to lock your keys in your car.

      I quickly learned to hold my keys in one hand, say out loud, "Keys in hand," and then lock the door with the other hand.

      This technique is perfect for any repetitive action that could go wrong with non-trivial consequences, and there's lots of that in everyday life.

      • wruza 5 years ago

        I'm always using the "phone keys cigarettes money" mantra together with patting on my pockets before opening any outside door.

        • GauntletWizard 5 years ago

          I wake up in the mornings with "Shit Shower Shave" and leave the house with "Wallet Watch Testicles Spectacles". Simple mnemonics work, doubly so if you actually say them out loud and check them each off.

        • encom 5 years ago

          I do that exact some thing, and I haven't smoked in 3 years. The downside is that if I'm supposed to remember to bring something, in addition to those 3 things, I'm extremely likely to forget it. If it's super duper important, I tie it to the door handle.

          • blackboxlogic 5 years ago

            To remember to bring a physical object, I leave my keys on it. Downside, sometimes people will bring my keys to me when they find them in strange places, like the fridge.

      • scott_s 5 years ago

        That’s me approaching a blue mailbox with my letter to send in one hand, and my keys in the other.

      • edgyquant 5 years ago

        I just put a spare behind the license plate

        • Gene_Parmesan 5 years ago

          Definitely a good idea. In the subject of the analogy (software incidents) I think both should be done -- a regular and habitual focus on important/high risk commands via procedure, and preparations for the time when the inevitable still happens because people are people and it's impossible to fully predict all potential sources of unintended consequences. A lack of habitual focus when important consequences are at stake could lead to an over-reliance on the safety nets, and you really don't want your safety nets becoming routine. Otherwise you'll need safety nets for the safety nets.

    • kube-system 5 years ago

      Repetitive tasks are exactly what pointing and calling helps with. The intent is to prevent the brain from going on autopilot for a task that happens exactly the same way 99.9% of the time, in order to prevent disasters that last 0.1% of the time.

      Traffic lights are a lot more random (and therefore mentally engaging) than the types of things train conductors are pointing and calling.

      An automotive equivalent of a situation that would benefit from pointing and calling is something like this: https://www.consumerreports.org/car-safety/guide-to-rear-sea...

      eg.: "Car parked, ignition off, get child"

    • Timpy 5 years ago

      Whenever I have something in my hand that I'm about to put down for a second in the exact absent minded kind of way that would leave me searching all over the house for it 5 minutes later, I say it out loud. "Headphones on the table by front door."

      • roland35 5 years ago

        Embarrassingly I once lost a hamburger while still holding it.. I had my arm propped up on a the back of the chair and it was just out of my peripheral vision. Not my smartest moment.

        • adiM 5 years ago

          I lost my sunglasses when I was wearing them! We were going to a state park for a hike. It was a 2 hr ride for which I was wearing my sunglasses but forgot. As we came out of the car to start the hike, I spent 5 minutes searching for my sunglasses in my backpack until my friend asked what I was searching for .... Maybe I should be saying "sunglasses on" from now on

    • uranusjr 5 years ago

      I believe the trick is to anticipate failure, and call out the normal thing instead. So you’d always slow down at every light, and only speed back up after calling out green. This is what all drivers are actually supposed to do, although I fully realise nobody practically does that, which is why we get so many automobile accidents all the time.

      • toast0 5 years ago

        Only speed back up after calling out green and intersection clear.

        I don't necessarily always do that, and don't make audible calls, but when driving at night or in inclement weather, I try to make extra effort to check for unexpected cross traffic.

    • nemetroid 5 years ago

      The pointing and calling performed by Japanese train drivers is very much about expected events. "Green signal" would be one of the most common call-outs. For example:

      https://www.youtube.com/watch?v=afjPmN0GT04

      Green signals are pointed at at 2:58 and 3:29.

    • bo1024 5 years ago

      Your example is a reactive event. Something happened in your environment.

      This idea is more useful for situations that you are initiating, and where feedback is not immediately obvious.

      An example could be turning your car’s lights on at night. Before starting the car, you force yourself to point to the switch, say “lights on”, and do it.

      I use this with keys. When leaving my office, house, or car, I hold up the key in my hand and establish sight (I don’t say anything out loud). Then I lock the door.

    • notJim 5 years ago

      I'm a photographer, and I used to get annoyed that I'd have little distractions on the edges and corners of the frame, because I was focussed on the subject and overall composition. I trained myself to sort of bounce my eyes around the sides of the viewfinder when pressing the shutter (think like the DVD player menu). Now I almost never forget to check.

    • leetcrew 5 years ago

      I don't think it really applies to stuff like driving, which almost has to be muscle memory to work at all. even with something routine and non-urgent like switching gears in a manual, the steps have to happen faster than you can say what you're doing.

      a good example from normal life is (physical) key management. I used to always forget my keys when walking out the front door, which was a big problem since it locks automatically. to solve the problem, I made my back right pocket be the designated "key pocket". I now slap my right butt cheek whenever I leave a building. it might look weird to observers, but I have not once forgotten my keys since I implemented this system.

      • cecilpl2 5 years ago

        After losing my wallet several times and not having a clue when the last time I had it on me was, I implemented a similar system. I now habitually triple tap my three designated pockets for phone, wallet, keys, every time I walk through a doorway.

        That way, if any of them are missing, I know they must be in the room I just left.

      • cortesoft 5 years ago

        I do a "wallet keys phone" mantra when I leave a building.... has a bit of a melody to it that I always repeat

      • tsomctl 5 years ago

        I do that too. The important thing is to pat your pocket before closing the door. Twice now I've done it 2 seconds too late.

    • SkyBelow 5 years ago

      Invert it and I think it works. Always prepare to stop at an intersection. Then point out it is green and call out you do not need to engage in stopping.

      It may seem silly, but if we asked people who drive 30+ minutes every day if they have every accidentally ran a stop sign or red light, I suspect the numbers would be quite high (though they likely happen at times/places where chance of accidents are the smallest, such as empty roads late at night).

    • shezi 5 years ago

      I teach my children to point in the direction of where cars can come from before crossing the road. He used to just swing his head around before, now he has to search directions and point there to direct his attention and it works excellently.

      As others have pointed out, this is for repetitive tasks that your brain wants to automate away, but you really want to keep in attention.

    • hrktb 5 years ago

      It can be used for exactly the same purpose: checking the environment before doing the action.

      E.g. force yourself to read the “production” part of your prompt before running the command. Point at the user name before deleting its record. Read aloud the version name before sending it to deploy.

      It really makes a different between just glancing at the info, and having to parse it as part of an action.

    • jrumbut 5 years ago

      Let's say you get a request to delete users #s 1, 17, 152, and 43.

      Now you can have the request and database administration tool open and point and call at the numbers and any queries and make sure you are deleting the right users.

  • saberdancer 5 years ago

    OpenShift does this by forcing you to write the name of the project you are about to delete. It was something that used to annoy me but reading this I understand it is a good call from their side.

  • rachelbythebay 5 years ago

    I do that when I drive around. Car on the side street. Kid over there... with a ball. Hidden left turner in 3...2...1... yep.

    I love finding out that this stuff works.

  • nailer 5 years ago

    I do things like

      const HARD_CODE_TEST_DATABASE_FOR_SAFETY = 'unit-testing'
    
      destroyDatabase(HARD_CODE_TEST_DATABASE_FOR_SAFETY)
    
    1. Avoid silly terms our industry should have ditched years ago, like 'drop'

    2. Making sure that nobody will ever change HARD_CODE_TEST_DATABASE_FOR_SAFETY because they thought it should 'always be the active database' or whatever.

  • justinlloyd 5 years ago

    I have had many disasters in my software career because I jut wantonly hit "Y" without thinking about it.

    I have noticed, since learning to cook at a professional level in the kitchen, that I point and call out a lot more in my other activities too. "From hot behind" and "knife" and "oven is over temp" to "Saw blade is live" and "circuit is live" in the workshop to "production server" and "erasing records" in database maintenance. Some days I feel like Sigourney "I have one job damnit" Weaver in Galaxyquest. It's a useful stop-think-go sanity check.

  • uyt 5 years ago

    This is true for NYC subways too! https://www.youtube.com/watch?v=i9jIsxQNz0M

    • greenyoda 5 years ago

      The video doesn't really explain why conductors point at the signs - it just says "to prove they're paying attention". Paying attention to what? The answer is that they are verifying that the train is correctly positioned in the station so that all of the doors will open on the platform.

      Explained here: https://www.nydailynews.com/new-york/mta-conductors-point-st...

      • tialaramex 5 years ago

        This comes up every few weeks on HN but nobody has ever offered any statistics that would suggest this is as good let alone better than just having the trains handle alignment automatically. It's a task humans are bad at and machines are good at, so just giving it to machines makes more sense, modulo unions.

        London Underground hasn't had guards for decades at this point, and the Docklands Light Railway hasn't even had drivers (there is a member of staff who is trained to be able to drive it on every train, but they are usually doing other things) since its creation. If they're misaligning often enough for it to be possible for New York to be statistically better I haven't seen anything about it after repeatedly asking.

        • jpcooper 5 years ago

          Actually what exactly is the member of staff doing on the DLR that is necessary, other than answering tourists' questions and putting a triangular key into a receptacle at every stop and then turning it? I have not been able to figure this out.

          In the Netherlands, the NS has two types of trains that go between towns. Intercity and Sprinter. Sprinters have someone who will walk onto the platform at every stop, or failing that, lean out of the carriage, verify that no one is getting in, and then step into the train again to put the key into the receptacle and then turn it. Following that, the doors close. In contrast, there is no such person on Intercity trains; they do fine without. There may be a conductor who checks tickets. In comparison to the DLR, both Sprinter and Intercity trains have drivers.

          Is there some requirement or function that I am missing that requires a dedicated member of staff to perform this key-turning ritual at every stop on the DLR and Sprinter, or is this simply to appease the unions?

          It could be that Sprinters are meant to be more lenient towards people running to get on than Intercities, which might have a stricter schedule.

          • tialaramex 5 years ago

            It's a GoA 3 system, so it isn't designed to be safe without a human staff member on every train. There are GoA 4 systems which do not need a human but the DLR isn't one, so while it would seem to operate normally if you just let passengers operate the doors - when anything goes wrong those passengers are in trouble because the system design assumes a trained member of staff is there to fix it and now there isn't.

            That triangular key opens a panel by the front left seats of the train, which reveals a complete set of controls for manually driving the train which that member of staff is trained to use. If the GoA 3 system has given up when the train is just out somewhere random then "just get out" while technically possible since there's a walking route along the side at all times - is clearly not ideal even for able-bodied passengers, so in fact the member of staff will drive the train manually to a station unless obviously that's impossible somehow (e.g. terrorists blew up sections of track either side like a Hollywood movie).

            Because humans are bad at driving trains, they aren't allowed to move at full speed, they can either let the GoA 3 automation oversee everything (e.g. it won't let them go anywhere it wouldn't be willing to go) at a reduced speed or when that's not useful they can switch off all automation and move at a crawl with no oversight.

            Every morning the first train of the day on each route is driven in the first of those two modes, because overnight human maintenance teams sometimes manage to leave tools and equipment on the line and the automation doesn't know not to drive the train into a welding kit left on the track by some idiot who just discovered his wife is leaving him or whatever. So the human staff member's job is to drive the train (with the AI preventing them smashing it into other trains) while looking out the front window for problems.

  • viraptor 5 years ago

    I try to do that during incidents. I'm not 100% there since it's no a company rule, but it helps me at the time and later when writing up details: "I see <behaviour X>", "<Y> should fix it because <Z>", "I'm starting to do <Z> now and seeing ...", etc.

    It also helps when Z results in a total meltdown and you need to pull in more people to help out, so they have context of what happened.

    • Qu3tzal 5 years ago

      French firefighters do this when arriving at a scene. The first messages sent over the radio will say:

      - I am... (who you are and where you are)

      - I see... (describe what you see in simple non-ambiguous terms)

      - I do... (what action you are taking now)

      - I ask... (ask for reinforcements if necessary, you may be asked to justify yourself more)

  • xvf22 5 years ago

    Killed just under 1k access points when they all upgraded on one go. They had no problem erasing the firmware but when they all tried to download the new one at once it killer the service and we ended up with a lot of blank APs. The conformation message for 1 or 1000 APs is unhelpfully "This will overwrite all existing system images. Are you sure Y/N"

  • m463 5 years ago

    > forcing a cache miss in the brain

    That is an interesting way of looking at it.

    I think a router analogy might be more precise - more like fast path / slow path - where when most packets come in they hit the fast path in hardware, and slow path exception packets get handled by the cpu.

    :)

  • ekanes 5 years ago

    I do this with my kids, gesturing (not pointing) as it helps my mind remain focused on truly listening to them amid everything else going on. I probably look ridiculous, but I'm a better father for it so ¯\_(ツ)_/¯

  • stjohnswarts 5 years ago

    I always called it a "that can't be right" interrogative.

xamuel 5 years ago

I wish it were possible for similar prompts to appear before all sorts of policy-makers and bureaucrats. "It appears you are about to institute a policy which will require 400 million patients to sign an additional waiver every time they visit a clinic, this will waste a total of 354,921 human hours within the next year alone. Please type 354,921 to proceed."

  • gumby 5 years ago

    The motivations are different: the cost to the rule maker of the effort by all those people is nil. While the cost of not adding the paper is the risk of something happening in the future which could cost them their job. This is why the shoe removal theatre was added to flying: the risk of something happening is essentially nil, but if it did, heads would roll.

    This is not a criticism of bureaucracy or regulation BTW (I'm a fan of both, in general). It's simply a recognition that there's a misalignment of objectives.

    Not sure how to analyze the calculus in the case of rachaelbythebay's observation. Certainly there is one misalignment which is if the tool has sharp unprotected edges (e.g. can take the company's whole site down) the person who ran the program will be blamed, not the person who wrote it. Unless they are the same person, it's hard to get a proper feedback loop in place. The only tools we have are coding standard and code reviews: bureaucracy!

    • cortesoft 5 years ago

      In my experience, the protections are added after a Learning Review from an incident.

  • Joker_vD 5 years ago

    Yeah, it's quite surreal. "Hey, privacy is important, so let it make so that to handle people's private data, you'll need a permission from them". All right, now whenever you try to e.g. send a (paper) mail, you have to sign the waiver that yes, you do allow the post office to see and handle your name and your mail address. Not only that, all such waivers seem to be written as "I hereby allow <insert the legal entity> to handle my private data in whatever way they want to", so we're back on square one, just with more perfunctory paperwork required.

  • jackhack 5 years ago

    closely related: the Paperwork Reduction Act of 1995

    https://digital.gov/resources/paperwork-reduction-act-44-u-s...

    it requires the office of management and business to calculate the impact of records-keeping requirements impact on time and privacy, among other things.

    I do not believe it has resulted in a reduced recordskeeping burden. For the most part I simply see an estimate of how long it will take to complete my tax forms and permits, on the form itself. Perhaps others have different views.

    • mulmen 5 years ago

      Hard to say, knowing the cost of a new process could have informed a new design or requirements. We don’t know what the other path held. But I believe in general having more information allows us to make better decisions so this is a good act.

  • mulmen 5 years ago

    How do you know it was a waste? Maybe that was time well spent.

harikb 5 years ago

I have a habit of creating cli tools, which potentially do dangerous things, to default to dry-run mode. For example, instead of the typical `--dry-run` or `-n` option, my scripts instead had a cheesy `--do-it` to be non-dry-run. It is annoying as hell to my colleagues, but saved the day many times.

  • PureParadigm 5 years ago

    A coworker of mine would write all his bash scripts to echo out the commands it would run, and then to actually run it he would pipe it to bash. This way he could inspect the commands to make sure they were correct before running them.

    Something like: ./dangerous-script.sh $args | bash

    • GauntletWizard 5 years ago

      I would love a shell that allows you to “run” a script in manual mode - Where at the end of every command, every statement, it prints what the next command will be with all variables expanded or otherwise called out, and then requires you to hit “enter” to cause it to proceed. I write a decent amount something between README and Shell Script. I’ve already got an awk one-liner that parses the shell out of Markdown. I typically copy+paste, line-by-line, from my README and add a bunch of echo statements to verify what i’m doing.

    • tomjakubowski 5 years ago

      Is your coworker Willard Van Orman Quine?

    • dredmorbius 5 years ago

      Same, or save to a file, temporarily, check that, then run the resulting script.

    • meesterdude 5 years ago

      wow that's so clever and simple! Love it.

    • jacobwilliamroy 5 years ago

      I also do this.

  • jiggawatts 5 years ago

    In PowerShell, this is a native feature of the entire shell and hence scripts and commands.

    The following prefix in a ps1 script enables the -WhatIf and -Confirm parameters:

        [CmdletBinding(SupportsShouldProcess=$true)]
    
    To enable -Confirm by default for scary scripts, just use:

        [CmdletBinding(SupportsShouldProcess=$true,ConfirmImpact='High')]
     
    The nice thing is that in PowerShell, unlike bash, this flows through to the vast majority of other commands. If the script has the snippet above, then you don't have to litter it with "if ( $userSaidYes ) { ... }" blocks all over the place.

    Similarly, PowerShell automatically wires up logic to produce all of the useful modes you might want:

        [Y] Yes  [A] Yes to All  [N] No  [L] No to All  [S] Suspend
    
    This is very fiddly to implement manually, and "Suspend" is likely impossible for most shells.

    See: https://docs.microsoft.com/en-us/powershell/scripting/learn/...

  • yobert 5 years ago

    I did this with "--im-not-scared" for production mode :D

  • dmuth 5 years ago

    I do something similar with my scripts, but have `--go` action, even on a script that requires no other options, just so that if it's run without any options, the person running it gets a message saying what the script WOULD do, if `--go` were passed in.

    • hotsauceror 5 years ago

      I do the same thing. All of my scripts have a -defang parameter which walks through the entire process, including placeholder log messages, but not actually performing the operation. My run books always say to run your exact command with this switch first, to proofread it. For some dangerous scripts, defang is enabled and has to be manually turned off. Defang is also nice because it will tell you e.g. here’s the size of the backup you’ll be restoring, or the filepath you’ve composed based on your parameters, or confirming that you’ll be replacing an existing thing instead of creating a new one. It has saved me many, many times.

  • robaato 5 years ago

    Bash tip I picked up from observation - always start a potential command with #

    # rm -rf some_dir

    Then if you accidentally press return before completing it hasn't happened.

    When you have reviewed and are sure it is correct, you recall and delete the hash to execute - simples!

    • arendtio 5 years ago

      In my opinion, the option -r should only be allowed as the last parameter. Maybe with the exception of -f. Everything else is just f*ing dangerous.

      I mean, I use the # hack sometimes too, but when I don't, I find myself often being afraid of accidentally coming on the enter key.

    • stjohnswarts 5 years ago

      I generally throw up a status report type of thing "you are applying $this_operation to $this_many_machines on $this_farm. Continue (yes/no)?" and enforce yes/no full typing. Anything other than yes is a no

    • matart 5 years ago

      Does this work with autocomplete?

      • greenyoda 5 years ago

        Just tried it with bash on Linux, and apparently autocomplete works in a comment.

  • jrumbut 5 years ago

    Even having a dry run mode is exciting. Doesn't even have to give complete results just "I was planning to delete 3 files and create 7 files", gives a hint whether the command will blow up the system or not.

    • dingaling 5 years ago

      I wish SQL had a dry-run mode in updates and deletes for that reason.

      "Run it as a query first" gets 90% of the way until you drop a constraint by accident whilst rewriting it as an update :o

      • harikb 5 years ago

        For interactive queries / surgery, you do have an option with a transaction (begin/commit/abort).

        If it is Postgres (don't know about other dbs), you can go a way long way using "savepoints" and "rollbacks" to truly have a trial-and-error safe surgery on db. Still dangerous, but quite helpful. I hate working on any other db without those features. Postgres also allows schema changes to be within a txn envelope.

      • vlunkr 5 years ago

        I've thought the same thing. I also wish SET came after where. I've done "UPDATE table_x SET something = true"; and then forgot the WHERE clause.

      • krab 5 years ago

        Transactions and rollback is the dry run. The problem is that if you keep the transaction open for too long, you will block other updates to the same data.

        • cableshaft 5 years ago

          Yep, I always write any update queries as a rollback transaction with some selects inside it to verify what the data looks like after it's done now, before I switch it to commit. I primarily use Microsoft SQL Server right now, so I also use WITH (NOLOCK) to prevent issues running my query will have with other updates.

      • skymt 5 years ago

        Enough folks have replied that transactions are the way to go, but I just wanted to add that whatever interface tool you use for your database may have an option to force you to commit your transactions manually. For example PostgreSQL's default 'psql' shell has the "autocommit" option which, when disabled, requires you to manually 'commit;' before any changes take effect.

      • SkyBelow 5 years ago

        I think an improvement to SQL would be for insert/update/delete clauses to require a where clause and allow for something like 1=1 if you really intend to hit all rows. A safe but even more invasive would be requiring an end to the were clause as well (to prevent selecting a few but not all constraints).

      • verve_rat 5 years ago

        Wrap it in a transaction and roll back the transaction at the end. Then remove the transaction when you are ready to do it for real.

        You can jam a select in the end of the transaction to check what happens.

      • cbm-vic-20 5 years ago

        MySQL has a command line option "--i-am-a-dummy" (aka "--safe-updates") for exactly this purpose.

        https://dev.mysql.com/doc/refman/8.0/en/mysql-command-option...

  • austinl 5 years ago

    I like this format in general, since it communicates the command is severe/irreversible. Heroku implements a similar confirmation when performing destructive actions. Commands require your to pass a `--confirm ${APP NAME}` flag, so the original command itself does nothing. Of course, this doesn't prevent you including those flags in makefiles, etc. I once dropped a table in a side project by accident because I took the wrong tab autocomplete suggestion in a makefile.

  • leetcrew 5 years ago

    works great until some asshole puts

      alias harikb_script='harikb_script --do-it'
    
    in their .bashrc to eliminate this annoying step.
    • actuallyalys 5 years ago

      I suspect someone who'd do that isn't going to take that or other precautions seriously regardless of it being aliased. It's still a problem that they're circumventing it, but I think you have a larger problem if someone with that mindset has access to production.

    • xaedes 5 years ago

      This would help a bit: Don't accept the "--do-it" as first parameter, make it obligatory to be the last.

      • X6S1x6Okd1st 5 years ago

        If someone is a programmer and is trying to disable safety features making it slightly harder to do so doesn't really seem like the solution.

      • _ikke_ 5 years ago

          my_command() {
              command my_command "$@" --do-it
          }
        • xaedes 5 years ago

          Good point. Stuff like this is why I wrote "a bit". Thank you for providing an example, why it wont be enough.

  • Xophmeister 5 years ago

    We've been known to use something like --yes-i-really-mean-it-this-time for really dangerous options. It's a like built-in solemnisation step.

  • vehementi 5 years ago

    I once came across one like this

    $ run-script.sh --dry run

    `--dry-run` parameter not recognized

    Executing ...

roydivision 5 years ago

Reminds me of the proposal to keep the nuclear launch codes inside the body of an innocent volunteer, so the President would have to kill the person to get the codes.

https://boingboing.net/2015/12/11/proposal-keep-the-nuclear-...

  • chrisseaton 5 years ago

    I've never understood this idea.

    If you believe we should never use nuclear weapons, then don't have them at all.

    If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use? You could have a situation where everyone was agreed to use them but the president was physically unable to harm the aide to use them.

    You can know that something is the right thing to do but not have the courage to physically harm someone to do it.

    An interlock that you may not be able to unlock for reasons unrelated to the task at hand is a bad interlock.

    • shuntress 5 years ago

      >You can know that something is the right thing to do but not have the courage to physically harm someone to do it.

      In this specific case the "thing to do" is literally to harm hundreds of thousands of people.

      The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual. Otherwise, it is likely that dropping the bomb would be a mistake.

      • chrisseaton 5 years ago

        > The reasoning behind this proposed interlock is that any logic which concludes that it is moral and rational to harm hundreds of thousands of people must also conclude that it is moral and rational to harm the "interlock" individual.

        Yes, but you can know it's the right thing to do, but not be able to physically do it.

        The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not. Him being unable to do it tells you nothing about whether they should be launching the weapons.

        If the president fails the test that tells you nothing about whether the launch is the right thing to do. Doesn't that fundamentally make the test bad?

        • UncleMeat 5 years ago

          It isn't about testing if the president can do surgery. It is about forcing the president to look somebody in the eye before they kill them.

          • chrisseaton 5 years ago

            > It is about forcing the president to look somebody in the eye before they kill them.

            Right, but can you understand that 'the President being able to look somebody in the eye before they killing them' is not a requisite for 'the employment of nuclear weapons being justified'?

            We require the president to be able to do B before they can do A. But what if A is the right thing to do but the President is not able to do B? Being not able to do B does not mean A is wrong.

            See the logical disconnect?

            • jai_ 5 years ago

              I think that is the exact point though. The point of the interlock to force this logical dependency when there may not have been one before.

            • Aeolun 5 years ago

              Doing A cannot be the right thing to do if you think doing B is still impossible.

              If you cannot kill your friend to kill a few hundreds of thousands more, how can it possibly be justified? I just struggle to come up with a scenario where that is the case.

              Of course I’m of the school that thinks firing nuclear weapons is never a good idea.

              • chrisseaton 5 years ago

                But ‘know you should’ is not the same as ‘can physically bring yourself to do it’.

                • jai_ 5 years ago

                  But that is the exact point. Having a human interlock explicitly shifts the dependency. Knowing that you should launch nukes is no longer enough and being able to bring yourself to physically kill someone is the additional requirement that we are _deliberately_ adding to this process despite there not being an obvious logical link between the two actions before.

                • Aeolun 5 years ago

                  If you cannot bring yourself to cut one person open (for the good of all) you have zero business launching nuclear weapons (for the good of all).

                  I kind of feel like we’re going in circles though, so maybe better to just stop here :)

            • UncleMeat 5 years ago

              I believe it is a requirement. I believe that the natural bias would be towards using nuclear weapons when we shouldn't. I believe there there is no possible world where the use of nuclear weapons is justified and the president couldn't also kill one additional person. I do believe there are cases where a president may use nuclear weapons when it isn't truly justified and that having additional checks will help prevent that.

        • lmm 5 years ago

          > The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not. Him being unable to do it tells you nothing about whether they should be launching the weapons.

          Our emotional systems are the product of millions of years of evolution and often (not always, but often) show better judgement than our "higher" faculties. Bringing that part of our capabilities into the decision-making loop is a very good idea.

        • shuntress 5 years ago

          >The president's ability to physically cut someone open is not relevant to whether it's a good idea to use nuclear weapons or not

          I'm sure if the president was physically incapable of wielding a knife, she would have someone on hand to do that for her.

          • chrisseaton 5 years ago

            I think the whole point of the 'rule' is that they have to do it themselves.

            • the_af 5 years ago

              I think it would work equally well if the president had two aides and had to order one to butcher the other, in front of her eyes, in order to launch a nuclear strike.

              Regardless of the exact details, I think the point of this thought experiment is that for a head of state, the decision to launch a massive attack that will cause hundreds of thousands of casualties can feel a little abstract. "Bombing a city" can seem abstract, even if the president understands this means killing children. Understanding is quite different from feeling. However, if the act of ordering a bombing raid on a city involved physically murdering a child, it would definitely feel more immediate and less abstract.

              Your point stands, of course. But the part about removing the abstractness of the act seems relevant when ordering people killed.

              • Aeolun 5 years ago

                I guess that’s exactly why we order a bombing raid instead of an invasion. We don’t have to deal with the consequences of that action so directly.

        • bee_rider 5 years ago

          Do you mean, like, the physical strength required to cut a person open? That seems like a very specific implementation detail.

          • chrisseaton 5 years ago

            Well exactly - lots of specific details of literally cutting someone open that aren't relevant for a considered application of nuclear weapons.

            But really I meant being able to 'bring yourself' to cut someone.

    • motoboi 5 years ago

      You put too much confidence in human reason.

      Everybody agrees that this is a nuke-them-all situation, but the president, given himself part of the task of ripping apart human bodies, thinks more about the subject and decides a another diplomatic round is a better option.

      • FactCore 5 years ago

        I think that's the point. I'm personally not an advocate of this because it seems to be a little too "beat you over the head" with its moral metaphor, but the whole point is that the President should have to personally kill someone to understand the gravity of what they are about to do.

        From the perspective of an advocate I'd say: If they can't come to terms with killing one, who are they to execute hundreds of thousands?

    • jodrellblank 5 years ago

      > "If you believe there is a case where it may be moral and rational to use nuclear weapons, why would you want to put a potential barrier in the way of their use?"

      Because you think the point where they become moral and rational to use is way way way further than commonly discussed, and you want to put many barriers of many kinds (physical, emotional, logistical) to delay their point of use without completely blocking them.

      You could also say that if a person is incapable of doing the hard parts of the job, don't vote them into the position. (Downside of that is that you'll end up voting someone who doesn't mind killing someone in cold blood while expecting that to be a filter that brings more empathy to the position).

    • greggman3 5 years ago

      > If you believe we should never use nuclear weapons, then don't have them at all.

      Tell that to Russia. In the short amount of time only the USA had the bomb the USA bossed them all over with threats of using it.

    • gumby 5 years ago

      > I've never understood this idea.

      It's an attempt to make an abstraction concrete. Think of it as the trolley problem in real life.

      Stalin is famously supposed to have said, "one death is a tragedy, 100,000 is a statistic". Cynical or not it is how humans think.

      > If you believe we should never use nuclear weapons, then don't have them at all.

      Strategic game theory and Mutual Assured Destruction depend on the possibility that the other guy will use them if you do, and may be the only way to prevent their use. Interestingly this is one reason why you want the other guy to know your procedures, capabilities, deployments etc. Secret weapons have no deterrent value.

      • chrisseaton 5 years ago

        > Think of it as the trolley problem in real life.

        Well exactly... doesn't that show you that it's a bad idea? People don't know if they could bring themselves to throw the switch even if everyone thinks it makes rational sense.

        You're taking a rational, well-considered, strategic decision... and making the interlock a messy personal emotional one unrelated to the actual issue at hand. That sounds like the wrong way around to be doing things?

        • gumby 5 years ago

          > Well exactly... doesn't that show you that it's a bad idea?

          I don't think so, no. Sometimes we think too abstractly and make what turn out to be poor decisions. Emotions are really valuable heuristics and should be harnessed at a time like this.

          • dodobirdlord 5 years ago

            Absolutely not, mutually assured destruction only works if both sides know that the other is committed to carrying out a retaliatory strike in the minutes before their death. It’s essential that the person in the position to order a retaliatory strike be someone ready to kill hundreds of millions of people for no reason other than the fact that they said they would. Putting emotional barriers between that person and the codes they need to carry out that enormous responsibility just makes it less likely that they will be able to follow through. If there’s sufficient uncertainty about whether there will be a follow-through then the nuclear arsenal loses its deterrence factor and we’re back to having to live with the fear that our rational enemies may carry out a first strike on us.

            • Aeolun 5 years ago

              > Absolutely not, mutually assured destruction only works if both sides know that the other is committed to carrying out a retaliatory strike in the minutes before their death.

              Not really. You would need to be absolutely certain that the other party won’t carry out a retaliatory strike before they’re destroyed.

              The only thing that matters is that the other party is capable of indescriminate destruction, not the certainty they’ll actually do it.

              It’s like punching someone holding a gun in the face.

      • mikewarot 5 years ago

        Trolley Problems are themselves a bad idea... the Kobayashi Maru is a similar exercise. I, like Kirk, don't believe that there are situations that can't be worked around if there is time to think, and resources to act.

        • the_af 5 years ago

          Isn't the Trolley problem a situation that is, by definition, time sensitive? If you had more time to think and resources to act, it wouldn't be a Trolley Problem.

          If the answer to launch-nukes-by-cutting-a-human-aide is "well, I need more time to think" then maybe that's a good outcome?

  • networked 5 years ago

    It's the 1980s, and the United States implements this policy. What happens on the Soviet side? After the United States' announcement the Soviet press and Soviet sympathizers worldwide gasp loudly in horror. "How cruel are Americans, really? Is the barbaric act of murdering and butchering an innocent young man the only thing still able to keep their president from destroying our Earth?"

    The Soviet General Secretary soon receives a report about what the new policy means tactically. Americans will take several extra minutes, possibly more, to authorize retaliation. (The exact delay is subject to disagreement. Secret experiments are conducted to get the timing down. They are inconclusive.) Amid the decade's mounting tensions, a preemptive nuclear strike looks more tempting than before.

  • benlivengood 5 years ago

    Too bad sociopaths and narcissists are more common in positions of power. All it would do is uselessly kill a volunteer.

    Time is also of the essence for MAD; known delay only makes MAD less effective if e.g. sub-launched cruise missiles are faster than dissection. And do all the fallback commanders need their own willing victim to mount a response?

    • Aeolun 5 years ago

      I dunno, Putin? Yes. Trump, shmaybe? Obama, not really.

      I guess that’s why they consider the idea here and not there.

dgritsko 5 years ago

Similar idea as GitHub's "type the exact name of this repository if you want to delete it" confirmation dialog. Maybe that's really what you want to do, but in case that's not actually what you meant to do, having a few extra hoops to jump through seems like a good idea.

  • Hokusai 5 years ago

    > having a few extra hoops to jump through seems like a good idea.

    I think that there is more to that. You need to consciously type the name of the repo that you want to remove. Windows used to add a lot of jumps to get something done, and the result was mindless clicking the "yes" button and realizing 1 second later that you deleted important information.

    That extra hoops need to be cognitive meaningful.

    • Cthulhu_ 5 years ago

      Yes, and infrequent; the main issue with Windows (Vista mainly) was that it appeared far too often. Even with 7, when you're setting it up for the first time for example, I think it shows up too often.

      Same with Terms & Conditions. If you want your customers to truly have read and understood them, you have to show them a short quiz at the end of it. You're required to do a quiz in Europe nowadays if you want to engage in stock trading.

  • segfaultbuserr 5 years ago

    Some disk management software also has "type the exact label of this partition to reformat it" to prevent accidental data loss.

  • wjdp 5 years ago

    Do you type the repo name, or just copy/paste or select/middle click it?

    Half of me would want them to put `user-select: none` on that text. The other half has to archive 10+ repos and would hate that!

  • edanm 5 years ago

    That's what I thought of immediately as well! I've seen that pattern in a few other places too, and I always think it's a really good UX choice.

luhn 5 years ago

One of the largest AWS outages to date was caused by a scenario like this. [1] A mistyped commanded removed too many servers from an S3 subsystem, overloading the remaining servers and crashing the subsystem. The failure snowballed until the entire S3 region was down, which then caused issues with dependent services like EBS, ALB, and Lambda. They couldn't even update the status page because that also depended on S3.

[1] https://aws.amazon.com/message/41926/

  • HenryKissinger 5 years ago

    I remember that. The AWS dashboard was all green checkmarks... because the red checkmarks icons the dashboard was supposed to display were stored inside the crashed servers.

  • jodrellblank 5 years ago

    >"overloading the remaining servers and crashing the subsystem. The failure snowballed until"

    the entire Eastern Seaboard was without power?

    https://youtu.be/XetplHcM7aQ?t=693 (James Burke's Connections, ref. cascading power cut 1965)

jasonpeacock 5 years ago

Raskin talks about the futility of this in his book The Humane Interface.

Basically, what happens is the brain switches operating context from "I want to do something" to "resolve this interruption (confirmation box)" and you don't relate the one to the other - you're so focused on getting rid of the interruption that the original task is forgotten until after the interruption is gone.

Then you switch back to the original task that had been interrupted by the confirmation box and then you realize you made a mistake.

It's much better to engineer "undo" ability into systems - like delaying commands (GMail's "Undo Send" does this), or caching previous state, etc.

  • andrewflnr 5 years ago

    That's exactly why it's not a "confirmation box", but requires you to slow down and think for half a second. She even talked about mitigating copy-paste, which is the next obvious way people could habituate.

    Also, while undo is great, it's not always technically feasible. The tools in question are basically for modifying the layer that implements undo for your end users, and are often themselves fundamentally irreversible. Undo for raw hard disks involves forensic analysis at best.

    • jasonpeacock 5 years ago

      The problem (I probably didn't paraphrase Raskin well) is when you slow down & think for a half a second, you context switch from "I need to do operation" to "I need to make this dialog box go away".

      No matter what tasks are required to make the dialog box go away - doing math, retyping a message, clicking a randomly ordered box - that becomes the top task in your head and you "forget" about the original task until you finish this task.

      Once you resolve the interruption, you switch context back to the original task and then you still have that "oh crap" moment.

      Yes, sometimes undo is very difficult, and can require a system designed to support that ability as a first-class feature from the start. Many systems you can perform rollbacks, but there are definitely destructive actions - in which case you should have test stacks to validate your actions in advance, and peer review. (e.g. dual keys to launch the missiles)

    • robaato 5 years ago

      Or you have commands which randomly reverse the meaning of the confirmation prompt:

      Continue: yes or no?

      Don't continue: yes or no?

      As long as operators know to expect this, they also know to wait and actually read the prompt before answering (as in turn of auto reaction)...

bronco21016 5 years ago

It amazes me that something like this can be done by a single person.

In aviation any time input is given to the machine, it's entered by one human (typically pilot flying) and then verified by the other human (typically pilot monitoring) before being committed to or executed. For example... when a new altitude is assigned by ATC, say FL300, the pilot flying will spin it in the selector window and keep his hand or finger there until the second pilot agrees with and confirms the selection by reading FL300 out of the selector window.

I know there are meat bags in these giant tubes so that changes attitudes towards safety etc. However, it seems to me that when organizations start putting the power to halt nearly the entire business in the hands of one person, there should be some slightly different attitudes. A breaking change in a million servers could easily cost hundreds of thousands or maybe even millions in lost revenue or employee productivity.

I'm just an outsider though. Perhaps this level of attention is practiced at some shops. It's just interesting to me how in some fields we settle on pretty uniform standard practices whereas others are seen as non-human-life threatening so it's just shoot first, ask questions later.

  • rachelbythebay 5 years ago

    Best practice for using the "weaponized" version of the tool when you had powers to actually hit all of them at once was to paste the command into IRC and get some of your fellow peeps to eyeball it and make sure it was sane.

    <me> team: hey, sanity check this please: hsh -A "dumb_thing && other_thing --foo --bar" <teammate> shipit

    [ I type the command ]

    <me> ok, running as job 1234

    The last part was a courtesy done so that they could watch the progress of it too without having to dig to find my request. It also meant they could kill it easily if something went wrong and they couldn't raise me for some reason.

    Tools like this are best used outside the solo realm.

    • im3w1l 5 years ago

      I think an automated tool would be preferable since there is no 100% foolproof guarantee that what you type in irc is the same as what you type in the terminal.

  • crispyambulance 5 years ago

    > It amazes me that something like this can be done by a single person.

    In many dysfunctional orgs, having someone to blame is desirable. They will use all kinds of words for it like "accountability".

    But at the end of the day, heros who take stupid risks that succeed get rewarded, cautious people that ask questions and try to understand before acting are smugly dismissed, and would-be heroes that burn the house down because of recklessness get blamed and make everyone else look good. It's all too common.

  • cle 5 years ago

    In shops where stakes are high, it’s not uncommon to do just like you said—have mechanisms that force someone else to verify what you’re about to do, before you do it. If someone else can’t verify, the tool will block you. It’s similar in spirit to requiring code reviews on all shipped code.

illumin8 5 years ago

This is a great idea, and I'd like to point out that having such a system in place would have prevented one of the largest Internet outages in recent memory - the Amazon S3 outage in 2017: https://aws.amazon.com/message/41926/

> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

  • zedpm 5 years ago

    It's kind of funny, since various operations performed in the AWS web console use this model (e.g. type the name of the resource you're trying to delete). As an organization, they're aware of this approach and think it's useful, but (presumably) didn't use it in their own internal tooling.

educationcto 5 years ago

Terraform prints out the number of resources changed and at least requires a "yes" to proceed. Not quite as onerous as described but at least prevents some type of fat-fingering. Basically all changes with Terraform are risky as they usually involved bringing up and down infrastructure.

   Terraform will perform the following actions:

  # google_compute_instance.vm_instance will be created
  + resource "google_compute_instance" "vm_instance" {
  + ... <more>
 
   Plan: 2 to add, 0 to change, 0 to destroy.

   Do you want to perform these actions?
    Terraform will perform the actions described above.
    Only 'yes' will be accepted to approve.

   Enter a value: yes
  • caymanjim 5 years ago

    This is exactly the problem the author is referring to. With Terraform, you always type "yes" to proceed, so it turns into muscle memory. You stop reading the output, and you're already typing "yes" before you even see the prompt. Terraform's output is also verbose, and many changes show up as "1 to add, 0 to change, 1 to destroy" because they don't separately list a "replace" category. It's pretty bad; you've got cognitive overload, confusing output summary, and a predetermined continue answer. And this is often an action you're performing under duress. I've been bitten by it plenty of times.

  • brodouevencode 5 years ago

    IaC is a real time saver, but inherently dangerous.

remram 5 years ago

A similar system is molly-guard [1], which replaces the reboot/halt/poweroff/... commands with scripts that make you type in the name of the machine before proceeding. Avoids shutting down the wrong machine because you forgot where you SSH'd.

[1]: https://manpages.debian.org/buster/molly-guard/molly-guard.8...

  • b6z 5 years ago

    Many years ago, I made that mistake two or three times, rebooting the wrong machine. Since then, I use molly-guard on all my remote machines. Never happened again.

Darkphibre 5 years ago

Reminds me of when the Fortune 50 company (150k employees) I worked for rolled out new firewall restrictions that blocked the DNS port.

To all machines. Employee and servers alike.

Yes. Including the DNS servers.

Took them a day or two to work out how to roll that one back.

  • zamadatix 5 years ago

    The first use of a new security product my manager insisted we roll out (as a duplicate to an existing tool from another group) was to quarantine a change in a system file that seemed to be spreading through all of the PCs.

    Except the change was to quarantine explorer.exe which was being changed with a patch that just got pushed out. The net result was about 6 hours of the desktop group wondering "why the hell are all of the PCs not logging in right after this patch" followed by about a month of rolling tickets from seldom used computers that had just been powered off.

    His excuse was it only showed a file hash in the main screen and you had to view details to see the name plus he had a 3 day change open to roll out the system. Never understood how he got away with that one but such things did catch up to him about 2 years later.

tialaramex 5 years ago

So, related obviously correct designs:

1. Git's Force-with-lease. Git push's "force" is too powerful, you will likely regret this much power, but it's tempting. So force-with-lease is the same power but conditional on you telling git what exactly the state was that you're overriding.

This has two benefits, one is like Rachel's, it is an opportunity for a human to stop for a moment and consider, wait, why are we overriding this state? To find out what it is we might as well read... oh the state says it's an "emergency fix. Call Jerry". Maybe, just maybe, I ought to call Jerry before I force overwrite it?

But the other is about race conditions which Rachel doesn't specifically address. If you are very careful to check that the state you want to overwrite with force is indeed a state that should be overridden, nothing prevents it meanwhile changing and then you overwrote state you didn't even know existed. But force-with-lease fixes that because your lease won't match.

I believe Force-with-lease is a pattern that ought to be far more widespread. I've used several configuration management tools that let somebody say "Temporarily don't mess with config on these machines" and some of them let you write a reason like "James is rebuilding the RAID arrays" but none of them have that force-with-lease pattern that would be let me say "I know James is rebuilding the RAID arrays, this change must happen anyway but if anything else is blocking the change then reject it and let me know".

2. Prefer Undo to Confirmation. If the computer can undo the action, even if that's a bunch of work and you'd rather not bother, put that work in and enable undo. Humans always know they "really" wanted to do the thing you're asking them to confirm so it's somewhat futile to ask, but they often realise they didn't want to afterwards and will undo it if you make that possible.

Not everything can be undone. Undo factory reset isn't a thing. But lots of things you can't undo it was just laziness, try to do better in your own software. Your users (which might include you) will be grateful.

  • coder543 5 years ago

    Related but semi-random: it slightly annoys me that force-with-lease goes through the entire effort of force pushing if it thinks the remote is identical to the local. It’s not going to change anything either way, and it could save me the second or two of waiting on it to do nothing. If local is already identical to the last known state of the remote, and I’m trying to force push, the actual error is that I didn’t edit the local branch in the way I thought I had when I decided it was time to force push.

    (I realize there is a possible error message case if the remote has changed... but I don’t feel like this command is the best one to use to discover whether the remote has changed, if you have no changes you actually intend to force push.)

vondur 5 years ago

That may have helped when Emory University's IT dept. accidentally sent a wipe and reformat command using Microsoft's SCCM to all of the Windows computers and servers on campus back in 2014. https://it.slashdot.org/story/14/05/17/051214/emory-universi...

kbenson 5 years ago

This is a topic near and dear to my heart, as I'm often that person arguing to make some slightly less automated because the small trade-off in time is insurance against some of the worst mistakes you can have. Automation to the point of removing humans leads to stupid problems that a human wouldn't make if they looked at what was going on. So we automate tot he point where we minimize human contact, presenting a summary of actions that as humans we can apply our wonderful brains to and prevent those problems. Except some percentage of the time we don't actually pay attention, and depending on how the human interaction was introduced instead of complete automation, some percentage (or multiple!) of errors still sneak through.

Automation to the point of minimal human contact where you assume the human will read the presented information and make an informed decision doesn't work. The point is that we want a human to understand what is being asked, so taking some step to ensure they do understand is warranted. It will never be perfect, but adding steps like she proposes are definitely a step in the right direction, IMO.

rossjudson 5 years ago

This resonates with me. Years ago I took down a service in a cell accidentally (Googlers might empathize: never 'borg' when you meant to 'borgcfg'). If I had been asked to enter the exact number of tasks I was about to nuke, I might have thought twice ;)

  • scottlamb 5 years ago

    I've certainly deliberately downed an enormous number of tasks, though, as part of a cluster turn-down. I love the technique of requiring the operator to echo a key fact, but in the case you're describing I think the key fact is not how many tasks but that that they're serving live traffic. So:

    * You could ask the operator to echo the qps figure...but really any number other than zero is likely to be an error, so it can just error out in that case without needing the confirmation.

    * Even if it is serving zero qps now, if it's not explicitly drained at the load balancer, downing it is likely to be a mistake. So even better to check that.

    Only once in my career have I taken down jobs serving live traffic. (They were serving 100% errors.) It was deliberate, but even so I wouldn't have minded having to supply a --yes-i-know-im-downing-live-jobs.

    edit: and if for some reason my assumption is wrong and downing undrained things becomes routine...well, you'd want to fix that, but as a short term measure going back to the confirming a number rather than the force option would be appropriate. Is certainly not good to have an override that's routinely used.

    • jeffbee 5 years ago

      The way we approached this on my SRE team was semi-manual with improved ergonomics. We embedded the live traffic graph in the turndown tool, so it would be right in your face before you took the destructive action. Of course it was always possible to go one level down on the tooling and do everything manually, but it wasn't the usual way.

      • scottlamb 5 years ago

        Seems reasonable, but as you might have seen, rossjudson did accidentally-ish go to a lower layer: he wrote "never 'borg' when you meant to 'borgcfg'". And you're still relying on someone actually looking at the graph in their face which isn't as sure a thing as it'd be if they had to echo something back as Rachel is advocating for.

        (For the benefit of non-Googlers/Xooglers: borg is a lower-level tool mostly used when everything else has gone wrong and borgcfg is a higher-level, more routine tool. These days people often layer things on top of that as well, because we love piling up abstraction layers. This approach is completely successful because abstraction layers never leak and solve every problem without making anything hard to debug at all. /s)

        In my ideal world, even the lowest layer a human ever uses would do safety checks by default. Eg, imagine if the job specification included "query this safety check service on change" and the borg tool (as part of querying the existing job on a cancel/rm command) discovered that and honored it. Most people/jobs would use a safety check that fails taking down a job unless the load balancer reports all relevant services have that job drained. The safety check service could also specify a confirmation prompt (similar to what Rachel is advocating) that could be customizable (like qps or percent of global capacity rather than just number of tasks). The safety check would be effective no matter what layer you use, and there'd be no good reason to use one that would cause prompt fatigue. The outage rossjudson described (and I know he's not the only one who has done exactly this!) would have been avoided.

        • jeffbee 5 years ago

          I really agree with your philosophy here but I've never been able to perfect it in practice. The imperfection comes from the way there is inevitably some mapping of things to other things by name. I can ask a load balancer whether clients of a service are being sent to a named capacity or not (i.e. is the thing I want to remove "drained") but that doesn't rule out the possibility that another service maps a different name to the same backend and I forgot to integrate that name with my automation. Also impossible to rule out that a client exists which bypasses or ignores the advice of the load balancer. Having visibility into caller identity helps a lot with this kind of problem but outside of Google there is a scary word called "cardinality" which prevents people from monitoring the whole caller×server space.

          • scottlamb 5 years ago

            I agree you can never reach perfection. I expect there'd still be postmortems with "Our safety check was missing/bad" in the "what went wrong" section for various project-specific technical reasons. But I'd expect there to be (a) fewer such postmortems, and (b) an action item to fix the job's safety check service specification and audit the team's other ones, rather than the rather inexcusable IMHO "this tool doesn't support those, /shruggie, maybe schedule more training about which tool to use".

gabeio 5 years ago

I do like this idea, this is I assume why github makes you type the repo name out in full. I wish AWS followed suit, when deleting any RDS (database) instance on AWS all you have to type is "delete me"... very easy to copy and paste as well as just know what you need to type and be on autopilot. I have even poked support about it and their response was underwhelming.

jaclaz 5 years ago

Side question.

How many/which companies have more than one million Linux machines?

  • notacoward 5 years ago

    At least Facebook (where OP worked), Amazon, Google, and Microsoft. Probably Netflix, maybe Apple. There might be a couple more, but no more than that because we've already accounted for a pretty high percentage of worldwide shipments for servers, disks, etc. Fun fact: when you're that big, your demand creates its own inflation and you have to consider that in projections.

    • kube-system 5 years ago

      If by "machine" we also mean things outside of a 19" rack, I would wager that large telecoms probably have way more devices running Linux than FAANG. Imagine the network of cable modems that Comcast alone must operate. What percentage of their 28+ million broadband customers rent Comcast owned/managed modems? Almost all of them except the tech-savvy crowd? And that's just one device type.

      • InitialLastName 5 years ago

        Not to mention the networks of cellular base stations worldwide that run extremely sophisticated systems (if not Linux itself).

    • jaclaz 5 years ago

      Thanks, so a handful at most, and the "usual" ones, I always thought that those companies keep their machines connected in (redundant) "sets" and that a command affecting all of them was more a case for "never" rather than "once in a while".

      • jeffbee 5 years ago

        Google, at least, has a thing that is supposed to prevent widespread disruption at the machine level, called the "Safe Removal Service"[1]. This is a good idea that in practice isn't perfect. If you write a tool that does not consult SRS, or your service doesn't declare a SRS policy, there can be surprises.

        A particular outage that I will never forget took out Gmail delivery worldwide in an instant, because the change was not expected to be disruptive and therefore did not integrate with SRS. As it turned out the change disabled the machines where it was applied, and the process of selecting a subset of machines to canary the change was not independent of the way in which Gmail assigns services to machines, so in the space of a few seconds they created a global outage.

        https://twitter.com/bgrant0607/status/1134536670504554496

  • abnry 5 years ago

    The number blew me away. But does she mean in one location or VMs?

    One million is a lot no matter how you slice it.

    • rachelbythebay 5 years ago

      How do you define one location? If it's like, a contiguous plat of land with a bunch of buildings, each containing suites, and each of those containing clusters... then these days, yeah, that's probably not too much of a stretch.

      And yeah, physical machines, not VMs. Sometimes they're blades, sometimes they're sleds, but I mean real hardware made out of metal that you can pick up and use to defend the datacenter if you have to.

      (Although, honestly, I was talking about global counts in the million+ range when I wrote it since it was referencing the past, but by now, a region with a million+ is not far-fetched.)

Ayesh 5 years ago

I have an old laptop with a dead battery, and for a BIOS upgrade, it prevents me from updating without 50% battery.

I have to type "danger" to bypass this restriction, and I thought it was pretty cool.

Another good UI pattern is in Firefox, that it disables the Run button on downloads for a few seconds.

  • duskwuff 5 years ago

    Disabling the "run" button for a few seconds was actually done to mitigate another risk -- sites cueing the user to click in a particular location, then triggering the confirmation dialog with the "run" button right where the user was about to click.

ineedasername 5 years ago

Oh god this would have saved me so much stress once. It was early in my career, and part of my duties was to run a merge/purge process on dupe records.

I'd select the dupes for merge using a checkbox, but the vendor's interface for this just had a "confirm" button. So, I confirmed. However I'd selected the "select all" box and.... confirmed. Merging every. single. record. into one (1) record.

I was fortunate, the vendor was able to roll back the changes, and nothing was lost. I also had a very good mentor-like boss who avoided reaming me out before we knew if there was a solution or not, and when there was he simply told me "I'm sure you've learned your lesson, but don't do that again."

aqme28 5 years ago

Nitpicking

> "This might be as simple as printing the number with your locale's version of numerical separators, like "123,456" or "123.456" or "123 456" or whatever else you might use where you are. The trick is then to NOT accept that as input, but instead demand that they remove the separator and jam it in as just digits. "

It's easier to just strip non-digit characters than to parse the input for them and respond accordingly. This is a confirmation step with basically a checksum, so you're not going to get many false positives.

  • Kerrick 5 years ago

    Stripping the non-digit characters would allow "123,456" to validate instead of only accepting "123456" -- which defeats the whole purpose of printing the number with numerical separators (to prevent copy/paste).

nemo1618 5 years ago

Notably, Discord does something like this when you @everyone in a large channel: "You're about to push a notification to 12,000 people, are you sure you want to do that...?"

  • pwinnski 5 years ago

    Sounds like a yes/no answer is expected? If so, that is exactly what Rachel is suggesting is not enough.

    • jerf 5 years ago

      In this case, usually the very fact that a popup unexpectedly popped up is enough. I use Konsole as my main shell, and like several other shells now it has a "You're about to paste 100KB, yes/no?", and I don't mindlessly click "yes" because it is already a "cache miss" to see that dialog at all.

  • raverbashing 5 years ago

    Slack should take a note of this. Especially for rogue @here notifications

tigger0jk 5 years ago

I've typically used pdsh https://github.com/chaos/pdsh for these types of commands, and I don't think they have any such safety options. The only protection is to be wracked with fear whenever you type pdsh. Obviously this fear wanes with use, and eventually you don't think about a command for long enough before you do it and hit enter on a regrettable one.

cle 5 years ago

Even better than you confirming your own action, is someone else confirming it. If the stakes are high, require two people to turn the keys, instead of just one.

rcarmo 5 years ago

This reminded me that a few years back I worked at a place where (notoriously) Puppet would occasionally go over some random box and remove access to people, just because.

Or to all the machines, on one occasion.

(It was actually some sort of race condition when we massively updated per-project access permissions and asked for SSH keys to be redeployed, but it was annoying as heck, and sure to happen whenever you really needed to access that particular machine.)

lqet 5 years ago

Github has been doing this for quite a while know when you try to delete a repository - you have to type in the exact repository name to confirm.

  • bmaupin 5 years ago

    Which I always mindlessly copy and paste...

    • jraph 5 years ago

      But maybe this is enough? I do this too, but this gives me time to actually read the repo name twice. It's way better than a confirm button for me.

      I'm sure it would also wake me up from autopilot. But I don't do this often so I can't really know. It seems like this is good enough for many people, who don't perform this action too often.

    • coder543 5 years ago

      If you really think that’s an issue, pasting could be disabled for that input field. Would that make you happier?

      It hasn’t been an issue for me, since repo names aren’t usually super long and onerous to type.

      • gruez 5 years ago

        > If you really think that’s an issue, pasting could be disabled for that input field. Would that make you happier?

        many (most?) HN users probably have that disabled, because too many sites abuse it to block password managers, for "security reasons" .

        • coder543 5 years ago

          I disagree with the many/most. Many/most are probably using uBlock Origin, which doesn’t try to prevent things like blocking pasting (to my knowledge). I’m sure some are using NoScript-like features... but that’s not the same as specifically preventing websites from preventing paste. It’s just a sledgehammer. I can’t name an extension to do that one task (and/or similar tasks) off the top of my head, and I’m reasonably familiar with discussions in these parts. uBlock Origin is known to be very popular, unlike an obscure “allow paste” extension. But, that’s just like, my opinion... as they say.

          The point I was making is that copying and pasting seems like more effort than just typing the repo name. Do you commonly encounter long, inscrutable repo names? Do you delete repos frequently enough to have built up the habit of copying and pasting the repo name into the delete box?

          If it is common enough, disabling paste would actually benefit the user based on the premise of the article.

temporallobe 5 years ago

This is similar to a UI solution a colleague and I came up with. The action the user could kick off was unstoppable and irreversible (a large batch job), and it seemed like even a confirmation prompt was too easy to simply click through. So we had the UI present a modal dialog asking the user to type in a specific word in all caps to confirm the action. Worked like a charm.

  • D-Coder 5 years ago

    I did a similar thing with a Star Trek program many years ago. One of the commands (22? 23?) was to detonate the warp engines in the hope of taking the enemy with you.

    After hitting the wrong number once, I added a confirmation that presented a random six-digit number that you had to enter before it accepted the command.

TravHatesMe 5 years ago

Reminds me of a study done where a test was given with questions that weren't difficult but likely to make a silly error. Around 85% of participants got at least one question wrong, but when they repeated the same test with a difficult-to-read font, that number dropped to ~25% or so. That's another way to make your brain work, use a terrible font.

  • apricot 5 years ago

    > That's another way to make your brain work, use a terrible font.

    And suddenly my complex analysis prof who wrote his exams in Comic Sans is vindicated!

willvarfar 5 years ago

I am so adding this to a query api I have, where its all too easy to leave off constraints and end up asking for massive data sets by mistake.

Thinking I can probably enhance it by forcing the user to type in the number as text rather than numeric, so they can't cut-n-paste. Kind of force them to type in "I am sure I want all data ever" or something.

  • recursive 5 years ago

    I don't think this is useful for an api. This is only useful when humans are the direct user of the component. Automated users, like those of an API will dutifully provide the required safety value.

mcintyre1994 5 years ago

AWS sometimes does something similar to this like “enter the name of the thing you’re trying to delete to confirm”. I think it makes sense because you can have such a huge difference between how much you care about certain s3 buckets or CloudFormation deploys etc. In true AWS fashion it’s inconsistent between services though.

  • nucleardog 5 years ago

    To their credit, even if it’s unintentional, every time one of those screens pop up I have to stop and think about what I’m doing because every screen wants something different from me!

heelix 5 years ago

Back in the Spiderman 2 days, I worked for a content management company that was supporting a really, really big website. I believe they were playing host file games for Stage/Prod. Was in the room on when they demo'ed something, did a restart of the system - and every pager in the room went off. Yah...

Cthulhu_ 5 years ago

I for one can't fathom any organization managing a million devices / servers / VMs / whatnot. I'm having enough trouble with one, and my biggest employers had maybe a few dozen at best, and they already had a dedicated ops team that worked mainly with infrastructure-as-code.

woliveirajr 5 years ago

Once I had to deal with some software-RAID in Linux (mdadm it is), around 2007. There was some -force option that would just print information explaining what it would do and, to perform the real action, you needed to type another flag (that should never be revealed).

Edit: added name of software

andrewfromx 5 years ago

i've done this before by displaying unix epoc and asking the user to copy/paste that value WITHIN a 3 second window as an env var. i.e. if you up arrow and run same TIMESTAMP=1603827448 ./foo it won't work because 1603827448 is now way too old.

  • myroon5 5 years ago

    One of the main benefits is explicitly acknowledging relevant context. Timestamps don't provide additional relevant context

sidpatil 5 years ago

Hmm, it's conceptually like a combination of a CAPTCHA and a launch code.

vsnf 5 years ago

I do this with a git pre-push hook to the main branch of my repositories. It displays a prompt in red and forces me to type in the name of the branch.

The result of one too many mindlessly accidental pushes.

regularfry 5 years ago

I've seen this implemented as "Please type: My username is $USERNAME and I will not cry over spilt milk" but that was more to guard against support tickets.

diebeforei485 5 years ago

I'm thinking this could also be useful for cases where colleges mistakenly email all applicants saying they'd been accepted, when they in fact had not been.

gitgud 5 years ago

> "I've worked at a few places that had a large number of Linux boxes. I'm talking about well over a million."

A few places!? What is an example of this?

ComodoHacker 5 years ago

In role-playing games, it's a common practice to confirm deletion of your character by typing in some word, like 'delete' or character name.

bnastic 5 years ago

Promise Pegasus (thunderbolt storage) comes with a GUI that does the same thing - to shut it down you have to type “CONFIRM” before clicking the button

Animats 5 years ago

Yes. Github does that when you delete a repository. You have to confirm by typing in the name of the repository you are deleting.

larrik 5 years ago

I've seen this sort of thing in a few places, and I really do think it's a great idea.

RobRivera 5 years ago

Having babysat my fair share of critical clusters, i support this advice

wotton 5 years ago

Marketo, the marketing automation platform, does this when you try to do things to large data sets, very useful.

konjin 5 years ago

Finally the Roman numeral converter I programmed in university will be useful.

eznzt 5 years ago

Debian already does this, it asks you to type something like "yes do as I asked" if you want to remove a package that is considered to be part of the core.

jerf 5 years ago

https://news.ycombinator.com/item?id=24907002

Looks like https vs http link.

jancsika 5 years ago

It would be neat to print out an esoteric error that gets a single result in Google, where the "forum" in the result has a rando answer about using a certain esoteric flag.

Then you search the logs to see who is trying the command with the esoteric flag and "fix the glitch with payroll" for those employees.

JoeAltmaier 5 years ago

Makes it harder to nest that command inside a script - you have to parse out the number and paste it back? Or do I misunderstand - should it still prompt the user in the middle of the process when that step arrives? That would be problematical if it were included in a web page or whatever.

  • ccakes 5 years ago

    The very point of this is to make it difficult to do what you’re describing.

    If the tool could potentially touch a large number of machine, even if you’re super sure you got it right you should still prompt the user

    • JoeAltmaier 5 years ago

      Or write a script that carefully calculates the number of machines and gets it right. I guess you wouldn't use this prompting script then?

  • larrik 5 years ago

    I believe this would be as part of the script you are writing, not the scripts you are calling.

  • rad_gruchalski 5 years ago

    Hopefully there’s an API to fetch that count :)

outworlder 5 years ago

> 1221425541 machines will be affected

"Do you care? (Y/N)"

Cattle, people. Not pets. Just make sure you don't hit all machines simultaneously and are rolling, instead.

Since the post is talking about automation anyway, assume that any machine that can go down will go down. Ensure that any such disruption will be minimal. Oops, you just killed the production database? Whatever, who cares, it has just failed over anyway (or, for a distributed one, a new node was elected, data started replicating, etc).

If one considers having to SSH to a machine to be an anti-pattern, it's amazing how much crap goes away.

In the more generalized case, where it's not about machines, then it makes more sense. Maybe you are running a query that's going to perform updates across multiple clusters. It still should not be done by hand with direct production access - unless you are in the middle of a declared (and urgent!) incident and everything is on fire. In which case there's a bunch of people watching over your shoulder (or more likely, screen sharing in a conference call).

The same job you have (hopefully) run in QA you should be able to re-target to production. Make the question just be a way to "unlock" your automation - for instance, by not copying credentials or environment information until the proper confirmation has been received. One should still have an escape hatch for when (not IF) things go wrong.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection