It's about what broke, not who broke it

215 points by rodrodrod 8 years ago · 104 comments

Reader

_n_b_ 8 years ago

I work in the nuclear industry, where most places are pretty good about maintaining a "blame-free" culture. You focus on what processes and procedures failed, what controls were missing, etc., that allowed somebody to make a mistake.

As this attitude was adopted, things shifted too far (at least in the opinion of industry groups, and my observation) to the point where people underperforming to the point of negligence weren't blamed, and the corrective actions to prevent reoccurrences of problems they caused ended up being cumbersome and expensive without really improving safety. (And in this industry, everything relates back to safety.)

In recent years, things have shifted back towards a more pragmatic middle ground. There are tools to assess if a problem was organizational (and it still almost always is) or if there was some element of personal negligence involved. This follows with an industry wide trend of trying to fix the real problems that affect safety and operations, not over-engineer cumbersome corrective actions.

lostcolony 8 years ago

Well, that's just it, the fact that underperformers aren't recognized is also a symptom of the process, and the fix is fixing the process, not throwing away the process.
Every problem is organizational, even those caused by individuals, because it's the organization's job to recognize and remove those individuals where appropriate.
- lev99 8 years ago
  
  In a bad organization many people that would perform to satisfactory levels begin to underperform. One signal of this is when individual responsibility is removed from the equation.
  I enjoy organizations where individual's expectations are clearly defined, and I prefer if there are consequences for missing those expectations because I feel like it increases the reliability of the team.
commandlinefan 8 years ago

> where people underperforming to the point of negligence
Well (to play devil's advocate just a bit) - isn't the ultimate end-goal of an end-state robust process one in which people can not just underperform, but be replaced completely?
- MR4D 8 years ago
  
  Only if the technology exists to do so.
  Given the above discussion about a nuke plant, think of all the complexity inside it: - monitors - alerts & triggers - valves - compressors and other rotating equipment - fire safety - electrical systems
  All of those have to be checked, tested, maintained, and fixed on a periodic basis over their lifetimes.
  An incompetent person (or group) will eventually be the cause of something.
  Maybe 100 years in the future we'll have self-operating nuke plants, but doubtful in my lifetime because of the incredible scale of complexity.
advisedwang 8 years ago

What are the tools you you to assess if a problem is organizational or personal?

nevatiaritika 8 years ago

My manager at work especially has the reverse attitude where the person who broke it is more significant than what broke/how we fixed it/ how to avoid it in the future. I have seen people get taunted for a bug they caused two years ago, a bug which didn't affect any revenue or was pretty easy to fix. And of course it still gets pointed out during appraisals.

Its a nightmare, because there's no room for experiment left anymore. Everyone just sticks to the template, afraid to do more than required, never deleting unused code etc. An attitude like this never ever helps!

smallbigfish 8 years ago

We don't touch production. We don't upgrade. We are are a X million company we can't afford the risks.
These are some of the excuses they put up.
And then they sit 10 years or more with that bad stuff in there, build even uglier ways around it.
But the time comes to actually do something about. And what was once a one day job becomes "we will hire a consultancy firm to guide us".
- iamdave 8 years ago
  
  Are you my coworker? Because you sound like him /s
  No really I appreciate risk management but when it cripples your ability to make decisions, innovate or otherwise ACT on information that could help you be a more efficient team and the develoment team becomes a room full of people doing nothing but maintenance for years on end people leave and companies fail.
  I just watched that very thing happen this year to my company for exactly that reason. Someone with the word "senior" in their job title was so risk averse that the market caught up, passed us and started eating our lunch.
  God help them because I can't do it anymore, and writing on the wall says they'll be closing up shop this fall. I'm out the door for good at the end of the week.
- bigiain 8 years ago
  
  Ha. "Outsourcing of blame"!
  Outsourcing of blame - as a Service. Where's my VC???
  - leoc 8 years ago
    
    Accenture. You've invented Accenture.
    
    bigiain 8 years ago
    
    Damn.
    (Second thought - did they patent it? "A system and method for reallocating blame and responsibility for business related negative outcomes. The negative outcomes include a plurality of career limiting moves and hastily made decisions." ... )
    
    StavrosK 8 years ago
    
    What do they do, exactly?
    
    shoo 8 years ago
    
    http://exposingevilempire.com/accenture/bigtime-consulting/
    http://exposingevilempire.com/accenture/bigtime-consulting2/
    http://exposingevilempire.com/accenture/bigtime-consulting3/
- itronitron 8 years ago
  
  code rots, whether it is being used in production or not, and should be consistently refactored and updated to accommodate the current status quo
  - thinkloop 8 years ago
    
    So much so that I don't think "refactoring" should even be a concept - that's what coding is. Every feature, bug fix or change should take into account the new relationships and structure of the model. Leaving refactoring for later is ignoring the fundamental required task of coding.
    
    iovrthoughtthis 8 years ago
    
    Eh, refactoring is distinct from feature developement, prototyping and testing.
    They are different modes of thought.
- ashleyn 8 years ago
  
  This must be how data leaks/vulnerabilities happen.
zer00eyz 8 years ago

Im going to say this -
When it is all said and done, if you fucked up, you should get some shit for it. However this should be good natured, YOU should be laughing at it and everyone else laughing WITH you.
- watwut 8 years ago
  
  No one has right to demand what I laugh about or not. There is enough shit I have to take regularly to have zero desire to have to pretend laugh to it.
  The discussions about everybody mistakes should be open, with emphasis on everybody, but leave mockery out of it. Inform everyone when they make mistakes without mocking them or attacking their egos and keep it factual. Don't assume everyone is friend with everybody nor that everybody is happy, it is not true. The line between laughing at it and with me is thin and oftentimes muddled.
  Relaxed laughing at mistakes is result of good teamwork, but you don't get to good work by demanding that people accept being laughed at or mocked.
  - scott_s 8 years ago
    
    Fully agreed. No one should "get shit" for mistakes on their job. Mockery has no part in productive organizations.
- iovrthoughtthis 8 years ago
  
  In the quiet isolation of a one2one, perhaps.
  Publicly i.e. I front of the team? No. It serves no purpose than to stroke some egos and reduce the "Overton window" of development discussion and experimentation.
  - silveroriole 8 years ago
    
    To me the one2one situation would feel like I’m surreptitiously getting bollocked for making the mistake, under the guise of humor. In public, on the other hand, I don’t care if someone takes the piss out of my code - I’ll laugh right along with them as long as it’s not overly malicious. IMO it sets a good example not to be precious about your code.
- bjpbakker 8 years ago
  
  > if you fucked up, you should get some shit for it
  If you're intending that you (and the rest of your team) should learn from your mistakes, then I fully agree
  - zer00eyz 8 years ago
    
    Yes, and it should be good natured... not at someones expense.
    Hence the reason YOU should be leading the laughter.
    There is a case that laughing indicates that the stressful situation has passed:
    http://mentalfloss.com/article/69830/why-do-we-laugh-when-we...
hennsen 8 years ago

The importance of this person is the part where i agree.
But defining this person important is independent from the action of loading this person with guilt and financial, career relevant or social sanctions . That makes access to the important knowledge actually difficult, because no one will want to admit errors and share how they happened and what could have been done to prevent them...
fwip 8 years ago

Sounds like it's past time to go job-shopping.
magic_beans 8 years ago

Unless you’re on a visa, I don’t understand why anyone would tolerate this. Why not leave this company?

csours 8 years ago

I took down an assembly plant by clicking on a Network status icon from a particular hardware supplier.

Over the weekend, firmware patches were applied, and the server rebooted. After reboot, everything worked fine, so the tech marked the change successful and went home.

Well, apparently the NICs would work just fine, but not all settings were applied until you opened the UI provided by the vendor. When you opened the UI, the final settings would be applied, and the NICs would reboot, just long enough to kill TCP connections.

That loss of TCP connection killed the parent system, and then all the other children systems also died when the parent died.

So who would you even blame there? The guy who set the tripwire? The guy who tripped on the tripwire? The guy who designed a system that could be brought down by a momentary loss of connection?

I'm lucky that my boss wasn't the type to point fingers, because I was the guy who was there when it happened, and it sure got a lot of attention.

dozzie 8 years ago

> [...] not all settings were applied until you opened the UI provided by the vendor. [...] the NICs would reboot, just long enough to kill TCP connections.
The UI part suggests that it was Windows, and if it was, it's not quite the case that "just long enough" to kill TCP connections, as you need quite a lot of downtime to terminate a typical TCP session.
In Windows, if a NIC goes down, all the TCP connections that use the NIC get closed immediately. (Or at least this was the case a few years ago. I had a similar system with similar drawbacks deployed back then, though it was an automated warehouse, not an assembly plant.)
> So who would you even blame there?
The idiots who designed the system to run on non-industrial-grade operating system. Windows was never a good choice to control industrial installations.
- dfox 8 years ago
  
  Windows is often the only vendor-supported choice for interfacing your computer applications to PLCs and such things. Also most of the proprietary protocols run over industrial ethernet are some kind of legacy serial (232, 485..) bytestream format wrapped in TCP and the software usually does not handle loss of the TCP connection particularly gracefully. (on multiple occasions I've seen rules like "reboot the whole installation on every shift change" to "handle" the obvious reliability issues of such systems)
  It is not about some small and well defined set of "idiots", it is essentially industry-wide design mistake.
  - dozzie 8 years ago
    
    > Windows is often the only vendor-supported choice for interfacing your computer applications to PLCs and such things.
    Which is not a problem by itself, since PLC, being an industrial equipment, should operate independently from a non-industrial equipment. The problem is idiots who think a desktop PC can reliably control PLC in real time.
    
    dfox 8 years ago
    
    Problem is when you have some kind of process that is inherently controlled not by the logic in PLC, but by some external system (either because the required data will not fit into PLC's data memory or because they constantly change based on some external bussines processes)
    Reasonable architecture for this kind of problem would be attaching some server to the PLC as peripheral, but it tends to be done other way around. As for reasons for that I speculate that it is simply inertia of the typical PLC programmer which is then compounded by reasoning along the lines of nobody does that, so it is not tested and we will hit unknown bugs in the PLC firmware itself.
    
    dnh44 8 years ago
    
    Is that a reference to Beckhoff?
- csours 8 years ago
  
  > In Windows, if a NIC goes down, all the TCP connections that use the NIC get closed immediately.
  Yes, that seems more likely.
  I think Windows can be a decent platform for light industrial applications - which this system in particular was. The problem is all of the partners and suppliers were either stuck in the past, or had weird ideas.
  The parent system was *nix based, but there was a flaw in a communications protocol that lead to the channel bouncing between two boxes, and eventually bringing down the parent system.
  My lesson from that was that you can have flaws on any system, no matter how solid the OS.
itronitron 8 years ago

one view that a lot of your colleagues may have had is that you just made clear to the company how relevant their jobs are (I am assuming most of the systems were built in house) and that decisions that were made in the interest of expediency can now be revisited in order to scope out additional work
- csours 8 years ago
  
  Shhhh... stop telling secrets.

aytekin 8 years ago

We have put a rule that made our system very strong over the years: We don’t care if you broke the site, just fix it quickly and more importantly write a test that will catch the same problem if it happens again.

Every time someone breaks something, we get harder to break.

taneq 8 years ago

Sounds like your system is antifragile.
- gowld 8 years ago
  
  Robust. The word is "robust". We don't need to promote buzzwords.
  - taneq 8 years ago
    
    I think there's a worthwhile distinction between 'robust', meaning 'able to resist stresses', and 'antifragile', meaning 'able to react to stresses and become stronger.'

CoolGuySteve 8 years ago

I used to think this way until I started working with someone who was nearly always the one who broke it. At some point we just had to face the fact that his work was unreliable even after significant mentoring.

If the tasks were difficult that would be one thing, but I'm talking about stuff like committing code to prod that was clearly never even executed once.

vvanders 8 years ago

Sure and those people exist. However what you're really looking for is remorse + understanding of the magnitude of the issue.
If you have those two things then someone is already motivated to learn from what happened and will probably never make that mistake again(which are the large majority of engineers in my experience).
pnathan 8 years ago

Most mistakes aren't problematic. But while we blame the code, not the writer - it serves well to quietly have a counter of "problematic errors" and to keep an eye on the people who increment it the most. After a while, and after a pattern has been established...
- jerrre 8 years ago
  
  You could also have a counter of problematic area's: some parts are easier to break than others, and could be improved/made more robust...
  - pnathan 8 years ago
    
    > You could also have a counter of problematic area's: some parts are easier to break than others, and could be improved/made more robust...
    Oh yes. Nothing I said should be read as precluding that. High-risk components cause more fallout when they break, that's how it works. Or fragile components break easily. Sure.
    But it's wise to keep an eye out for people who are seriously problematic and to establish a pattern of problem-causing, for the purposes of remanding to HR.
altano 8 years ago

Sounds like you have a code review and automated testing problem and not a bad coworker problem.
- lawn 8 years ago
  
  Sounds like both.
- CoolGuySteve 8 years ago
  
  We have both of those setup, the guy is just an asshole.
uoaei 8 years ago

In that case, the "what broke" was the hiring process and the fix is them leaving that role.

ComputerGuru 8 years ago

I do a lot of open source work and unfortunately a very common posion is focusing on “who broke it,” which is especially disparaging when done in public. A particularly nasty habit is when outsider Alice opens an GitHub issue saying “xxxx is broken” and developer Bob replies with “yup, @Charlie’s commit fubar’d everything.”

Unfortunately both very demoralizing and very common.

silveroriole 8 years ago

Demoralizing - why? That seems like an attitude problem on the part of “Charlie”, not “Bob”. If “Charlie” is going to slink off with his tail between his legs every time he makes a mistake, he’ll have a tough time of it - it’s not like everyone can’t SEE that he broke it through version control anyway!
I just don’t really get it. Even when I was a junior, if I overheard “this thing is broken,” I was the first to pop up and say “oh, I bet that was me, let me have a look.”
- ComputerGuru 8 years ago
  
  I’m with you 100% except you’re not taking into account what I said about this being in public and Alice not being a part of the project. Internally assigning blame isn’t the issue, it’s about the “team” facade being shattered when dealing with the outside. If you’ve accepted Charlie into the organization then from without it isn’t about Charlie or Bob, the answer should be “yes, we’re aware; a recent commit broke that functionality and we’re working on fixing it.” I’m not even talking about a dev mailing list or GitHub PR discussion, I’m taking about the specific case of badmouthing a developer to an enduser.
  Imagine if Apple came out and said “yeah, that blank root password bug, it was all because of John Smith and his crap patch that caused this.”
  Outsiders don’t have the same perspective as insiders. If Charlie’s commit message read “implementing the really difficult thing we talked about,” the team might be aware of mitigating factors that Alice won’t. But even without those mitigating factors, all you’ve done is badmouth your own devs to the public. Additionally, you are not considering whether Charlie is an otherwise stellar developer that has never had a bad patch before. Alice may incorrectly presume that the only reason he’s being called out is because this is a habit of his, perhaps.
  - lev99 8 years ago
    
    I often compare open source projects based on what's visible on the github page.
    Drama around a volunteer team in the open is a bad smell.
    Edit: I'd like to explain why.
    * Open source projects with lots of drama often don't attract new talented developers, and if talent happens to depend on that codebase they are more likely to fork and start a new community, or fork and not submit pull requests.
    * If I need to interact with the team for pull requests or bug/support tickets I'd like to feel assured we can do so respectfully and professionally.
    * If a community has drama in it I am less likely to recommend the software to a friend or blog about the software because I won't want to be associated with it. I'm more likely to stop using it and switch to a different solution.
- mieseratte 8 years ago
  
  Two things:
  First, it's generally best to praise publicly, and criticize in private.
  Second, saying "@Users's commit screwed-the-pooch" blames but, frankly, may not be the whole picture. It's entirely possible that the commit caused the issue, but everything was done by the book in which case it's really an organizational failure.
  Personally, I sympathize with your argument. I have no personal problem with Torvald-style correction. I used to work under an asshole who would threaten to have me fired routinely. Personally, I prefer the blowhards because you can always tell where you stand. Still, not everyone is wired this way and part of leadership is recognizing that and playing to various folks strengths and weaknesses.
  - taco_emoji 8 years ago
    
    Right. Is the codebase spaghetti? Did anybody take the time to help Charlie understand the system? Were there unit tests that should've been broken by Charlie's commit? Were there unit tests but no continuous integration, so he didn't know to run those particular tests? Was there no code review where somebody could've caught the bug? Was there a code review but nobody else caught the bug? Was there QA testing performed where the bug could've been caught?
    Etc. etc.
- foobarchu 8 years ago
  
  I think its demoralizing because it's active blaming. Charlie doesn't have an attitude problem just because he doesn't like being publicly blamed for something, Bob has an attitude problem because piling it on somebody else is his first reaction.
  The proper thing here is to acknowledge that there is an issue, but not assign blame. Go to the person you think is responsible in private, and let them admit the mistake in public if they want. Assigning them blame publicly shows a huge lack of respect, even if it was their fault, while admitting blame freely shows modesty.
  Plus, what if it's not Charlie's fault, and his commit simply revealed the problem? Perhaps the actual issue is in a little used function deep down in the codebase, and his commit is just the first one to actually exercise that area the right way? Maybe this whole thing comes around to being Jim's fault instead.
commandlinefan 8 years ago

I wonder, though, how much the culture of "talent" and "rockstar developers" contributes to this. We programmer types often perpetuate this narrative that programming ability is something that you're "born with" and you have it or you don't - unlearnable, unteachable and ephemeral is the mystique of the programmer. So, how do you figure out who the capable ones are and who the incapable ones are? Well, of course - the ones who f'ed something up are the incapable ones, who just didn't "have it" after all.

userbinator 8 years ago

I had to then tell them that this person still worked there.

The old IBM story is worth mentioning in relation to this: http://www.mbiconcepts.com/watson-sr-and-thoughtful-mistakes...

kosei 8 years ago

When someone makes a mistake, that's an incredible investment in them. I'm always surprised* when people try to throw it away by firing them or making them want to quit. Help them learn from it and apply that knowledge moving forward. Otherwise they're just taking that knowledge and using it to help another company.

*Obviously with the caveat that some people are repeat offenders who are careless or just not good employees

lev99 8 years ago

In other professions some mistakes cost the professional real money (doctor malpractice) or cause them to lose their license (drinking and driving with a commercial vehicle license).
As an industry we don't have a response to a truly neglectful mistake yet.

ashleyn 8 years ago

Reminds me of when someone ran "rm -rf /" at Pixar and deleted all of Toy Story 2.

The backups were crap and the only reason it survived was because someone took a server to work from home.

When all was said and done, they never really found who did it, they just made organisational changes to ensure it didn't happen again. No blame game.

andrewmcwatters 8 years ago

When I worked with my first non-remote team in Phoenix, I basically did this to our mobile app codebase with an in-house git repository due to some faulty rsync changes to a grunt task.
To the old NPL team, sorry about that. Culture is important.

partycoder 8 years ago

If in soccer the opposing team scores, who is to blame? the goalkeeper, defenses? the coach? the whole team? the referee? nobody?

Preventing goals means that the strategy needs to ensure good ball possession, and staying on the offense, to reduce the burden on the defense, to reduce the burden on the goalkeeper, who is the last line of defense.

If the last line of defense fails that's not an individual failure but a team failure, coach included, since the coach selects who gets to play, when and their roles.

Same in software: bad management passes the burden to developers, bad development passes the burden to testers, bad testing passes the burden to release management.

partycoder 8 years ago

Now, there are cases when everyone knows what to do, steps are taken so everyone is informed of it, but someone still decides to go against it. In that case the individual is at fault.

zer00eyz 8 years ago

It's not about whats broken, its about what you DO when it is broken.

This my favorite interview question to ask candidates:

"What is your all time biggest screw up, and how did you come back from it" - I then tell them the story of me loosing several hundred thousand dollars and the funny things that happened around it to set the tone. If you have been in tech for any length of time you have one of these stories (if not a few). I have heard some great ones by simply asking and it gives great insight into a candidate (humor, stress response, the things you have seen).

dancek 8 years ago

I think this is an important piece of organization culture. If the first reaction to problems is blame and punishment, issues are covered up. But if finding bugs and fixing them is considered valuable, there will be less issues in the long run.

Of course I write enough stupid bugs myself that I'm bound to think this way.

tzhenghao 8 years ago

This is so true. Providing the incentive to squash bugs than punishing people for making them is the driving force for innovation in a team. Take that away, and you get a toxic culture where everybody starts finger pointing when an issue arises.

PeterStuer 8 years ago

I found this to be the touchstone of spotting a dysfunctional enterprise. There it is all about the 'who', never about the fix. In those environments every new project is CYA from day 1. The disconnect between daily activities and the success of the company is so large, that all actions and projects are just about personal politics. A failure that can be blamed on the right target is often even a preferred outcome as eliminating a competitor for a promotion is even better than not having failed. If you find yourself in such an environment, try to leave asap.

silveroriole 8 years ago

Sure, if you have a huge company and a revolving door, the solution is a bunch of processes and idiot-proof safety nets, and no one person is to blame for most bugs. If you’re in a small company, the solution is to teach the devs by showing them what mistakes they made. I don’t think that’s a bad thing; if you write code, that code is your responsibility, and you shouldn’t be sensitive about people telling you your code is broken.

Also, focusing on the code itself, for me at least, easily leads to thoughts like “this function is crap! What idiot wrote this!?”. Finding out who broke it leads to thoughts like “I see John introduced this buggy function. I should go check with him, maybe he had a good reason.”

gjvc 8 years ago

Mishaps occur on a spectrum, and may be categorised from mistakes, carelessness, recklessness, through to malicious intent, and any combination of the above all along said spectrum.

Though these categories may seem like they are orientated on individuals' actions, they may be used to determine where the risk lies in systems (and people's use thereof) and how measures can be taken to avoid the same problems being repeated.

Much of the time, the complexity of systems (using the term in the widest possible sense) is under-estimated, and automated integrity checks are not used as religiously as they may be.

red_admiral 8 years ago

I'm 90% in agreement. Her workplace definitely sounds like somewhere I'd consider working myself (if I were looking for a job).

There are some things that I consider basic competence standards, like not storing passwords in plain text in any system you're building. I wouldn't fire an intern for getting that wrong but I also wouldn't let an intern near a production authentication system without some oversight.

If someone is a security engineer with a responsibility to know these kinds of things as part of their job role and certification, then if they'd implemented passwords-in-clear to cut corners somewhere, even if it's to meet a really important deadline, I'd be extremely unhappy. Of course I'd establish the general pattern of what had gone wrong first, and if it was a superior being abusive to the security engineer to get the product launched on time I'd still be really unhappy but not at the engineer.

Occasionally one does follow the chain of causes back though and finds not the organisation's culture but an individual who really should have known better.

rachelbythebay 8 years ago

If you can go back in time, join me in 2013 and you can enjoy the ride for a few years, too. I'm sorry to say that I don't think you'll get the same experience in 2018.

jancsika 8 years ago

The answer requires context, at least for FLOSS projects.

If unlucky dev #13 broke something because humans can no longer reason about the relevant part of the system, then it doesn't matter that #13 was the one who broke something. What really matters is that people get busy removing the sandtraps from their software.

However, many FLOSS projects run on the sheer joy and freedom that comes with maintaining a particular subsystem or area of the code. Most devs have a quick understanding of the responsibilities associated with that. But in cases where that responsibility doesn't come naturally, who broke becomes the focus. Addressing that issue will determine whether or not future breakages occur.

koliber 8 years ago

It isn't about who broke it. But if there is a person on the team who continually breaks things, does not learn from their mistakes and repeats them, or is not truthful when they break things, the team should react appropriately.

hennsen 8 years ago

It’s also about how it broke. And who broke it is sometimes the person who can say a lot if not most about that. Therefore i don’t recommend teaching to never talk about tge person who took an action that lead to a disaster, but rather encouraging a culture where admitting having taken a wrong step doesn’t lead to punishment, neither financial or social. Who broke it is an important part of the analysis, helping the organization to learn from each other’s errors. Making it a taboo talking about it is missing a chance for development...

pronoiac 8 years ago

Ooh, this is good. Part of it's covered under the name of "blameless post-mortems," but I don't remember searching for similar breakage, which is a great idea.

iramiller 8 years ago

This seems like a classic case of applying the Five Whys [https://en.m.wikipedia.org/wiki/5_Whys] methodology for root cause analysis.

drdeadringer 8 years ago

I don't see how this is not "better mousetrap, better mouse". Phrases from "they build a better fool" to "they build a better US Navy crewman" are a hundred a penny, and yes I've experienced the other side of this.

The best programmer vs the worst user, and every mix in between, shall produce situations needing attention this article addresses.

I've been in this situation on both sides. "Of course it should be clear what this phrase means, how could they fuck this up?" ... and ... "I have on idea what this means, both choices could mean what I want but either choice ends me up on the wrong page of this bullshit 'choose my own adventure' that I'll have to repeat if I'm wrong".

I'm interested in finding out if I'm understanding this wrong, and//or other thoughts.

gowld 8 years ago

The SRE Book teaches a lot of the lessons that this blog teaches. https://landing.google.com/sre/book.html

donttrack 8 years ago

I totally agree. Its usually the hallmark of a good team, if they have the "we are in this together" attitude.

lkrubner 8 years ago

There is the risk of conflating two separate types of problem. There are problems that arise from the complexity of the code, and problems that arise from particular people.

If a programmer has a habit of sloppy code, or violates the team's standards in some ways, then a good leader will keep track of the fact that one person is responsible for a recurring pattern of mistakes.

I absolutely agree with Rachel By The Bay, that many bugs arise from the complexity of the situation, and it would be wrong to blame the person who just happens to trip over that bug. But a good leader should take action against anyone who repeatedly screws up, and who seems unwilling to improve.

I've written about this before. This is from "How To Destroy A Tech Startup In Three Easy Steps":

----------------------

Wednesday, July 15th, 2015

I got to work at 11:00 a.m. John announced that our demo had stopped working. Sipping my coffee, I logged into the server to find out what the problem was. I looked at the error log for the API app, but it seemed okay. Then I checked the error log for the NLP app.

java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1955) at Celolot.nlp.Extractor.fuckBitchesGetMoney.java:87

What the hell was this?

“FuckBitchesGetMoney”?

What kind of name is that for a function?

A computer programmer can name their functions anything, but there are some “best practices” regarding names, and this particular function name violated all of them.

I asked Sital why he had given this name to his function. He looked at me straight, shrugged, and stated that the name was from the 1995 song by The Notorious B.I.G., “Get Money.” I replied that rap lyrics were not part of our naming conventions. He promised that he would change it.

Coming from anyone else, I might have interpreted the function name as an act of angry rebellion, but Sital was too forthright for that. Apparently, he thought the name was funny and went with it because he wanted to add some humor to his code. Never did he stop to think it might be unprofessional.

I looked through his code and found several other functions that had inappropriate names. I sent him a list and asked him to change their names to something standard.

A week later the function was still there. FuckBitchesGetMoney. Yet I don’t think that any of this was a deliberate act of rebellion. He was just oddly forgetful and disorganized.

https://www.amazon.com/Destroy-Tech-Startup-Easy-Steps/dp/09...

itronitron 8 years ago

if the function was still there, I think it is also likely that the old jar or class file (with the function) was still lurking in the classpath or your version control and build system weren't using his revision
- lkrubner 8 years ago
  
  The point is, he failed to make any revisions. He was oddly disorganized. Even with quite a bit of coaching, he was unable to do what we needed.
staticelf 8 years ago

lol, I wish there was a book with just those kind of stories.

teddyh 8 years ago

What’s that old saying; “Fix the problem, not the blame”?

nstj 8 years ago

I like this site and hadn't really read much from it - it's interesting how much it's been front paged over the last couple of weeks: https://news.ycombinator.com/from?site=rachelbythebay.com

krallja 8 years ago

Rachel is an excellent writer who was on a long break from writing. Seems like HN is happy to read her posts again.
- rachelbythebay 8 years ago
  
  Thanks! I was working a "real job" from about mid 2013 and am no longer, so my cycles are now all mine again. I was too tired to write most of the time before.
  Also, there are many more stories to be told now!
  - pnathan 8 years ago
    
    I am really looking forward to reading the new stories - I bought your collection of stories too. :)
  - nstj 8 years ago
    
    Nice work! And appreciate the "SuperOP" response :) Keep up the good posts, they're great to read!

BrissyCoder 8 years ago

I don't know. Where I work no discernible pattern can be found with the "what" that broke.

It's always the same f*ing people that break it though!

realusername 8 years ago

> It's always the same f*ing people that break it though!
Sometimes that's just the people who change things the most and work the hardest. It's harder to break anything when you don't actually change anything.
voltagex_ 8 years ago

If I had a developer who was breaking stuff often enough to worry:
* Do they have too much access to systems?
* Is there something really wrong with the deployment system?
* What training can be provided?
All of that is more constructive than your comment, as cathartic as it may be.
saagarjha 8 years ago

Is there a reason why it's breaking, though? Is it really because the person breaking it is incompetent, or is it because there wasn't enough documentation or education or safeguards in place to prevent this from happening?
pbhjpbhj 8 years ago

It amuses me that the sibling comments appear unable to imagine the possibility that someone is incompetent.
Of course there are other possibilities - the people breaking things are doing the hard bits that no one else dare to.
- jspash 8 years ago
  
  But wouldn't that imply that the "daring, thing-breaking" people are actually incompetent to some degree? Otherwise they would mitigate the risk before performing any dangerous operations on a live system.
  "Bravado is no excuse for lack of preparation." - Leeroy Jenkins
- TheCoelacanth 8 years ago
  
  Even in that case the overall system is still at fault for not recognizing their incompetence and either training them to be competent or getting rid of them.

erikb 8 years ago

It makes sense for a logical perspective, but in practice that's not how it works.

In reality if something breaks, and you are stupid enough to mention it, then (a) you are considered an a-hole for blaming <responsible-person-for-topic> even if you didn't and (b) responsible for fixing it.

So your main job is somehow make your stuff work despite all the other stuff that doesn't work and all the other people that try to stop you, silently. The less you criticize the better. What you get in return is that if you fuck up, people will try to avoid blaming you as well. Also if you don't succeed at making anything happen you get a little arrogant smile from your manager and a mediocre feedback round. But otherwise nothing happens.

The only change to that pattern happens when you piss off your manager or your manager's manager. Then suddenly each and everyt activity you do will be scrutinized and if there's a problem it will be used against you. The best hope they have is that you go away by yourself.

al2o3cr 8 years ago

"The best hope they have is that you go away by yourself."
I'd recommend you satisfy their hope maximally by running the hell away from that dumpster fire of bullshit office politics.

Settings

It's about what broke, not who broke it

Keyboard Shortcuts