Many years ago, I worked at Amazon, and it was at the time quite fond of the "five whys" approach to root cause analysis: say what happened, ask why that happened, ask why that in turn happened, and keep going until you get to some very fundamental problem.
I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."
I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.
My litmus test for these types of processes: If root causes like "Inflexible with timelines", or "Incentives are misaligned (e.g. prioritizing career over quality)" are not permitted, the whole process is a waste of time.
Edit: You can see others commenting on precisely this. Examples:
Usually things like that have to end up in retrospectives, and the first thing I hated about Scrum (or maybe just the first Scrum team, though they tried really hard to follow the letter of the process) was that you basically had to know about a problem for 5-7 weeks before you could get anyone to act on it. Because the uncomfortable items had to repeat at least 3 times before people wanted to look at them.
This was and is torture to me. I'm not going to fuck something up on purpose just to make the paperwork look good if I can tell ten minutes in that this is a stupid way to do it and I should be doing something else first.
Many developers have strong opinions that certain parts of a process are valuable and others aren't, and will try quite hard to align your process with their opinions as quickly as possible. For an organisation that doesn't know which developers' strongly held views are right and which are not, requiring everyone to try something for 5-7 weeks is probably more productive than any other approach they could take.
Having done quite a bit of politicing at a centuries old, med sized company, I can tell you that what management wants you is the assurance that this particular problem won’t happen again. Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting Though what I’ve found is that if you have enough clout you can add an addendum to the root cause analysis, and you can start getting into things like misaligned incentives. But always keep in mind, at best you can only point out this will mean this class of problem will keep happening.
If you do this, know that there be dragons. You have to be very careful here, because for any sufficiently large company, misaligned incentives are largely defined by the org chart and it’s boudaries. You will be adding fuel to politics that is likely above your pay grade, and the fallout can be career changing. I was lucky to have a neutral reputation, as someine who cared more about the product than personal gain. So I got a lot of leeway when I said tonedeaf things. Even still I ended up in crosshairs once or twice in the 10 years I was at the company for having opinions about systemic problems.
> Having done quite a bit of politicing at a centuries old, med sized company, I can tell you that what management wants you is the assurance that this particular problem won’t happen again.
I'm not disagreeing. I'm saying they should phrase it this way (and some do), instead of masking it with an insincere request for root causing.
> Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting
Occasionally this is the right thing to do. And often this results in a very long checklist that slows the whole development down because they don't want to do a cost-benefit analysis of whether not having an occasional escape is worth the decrease in productivity. And this is because the incentives for the manager is such that the occasional escape is not OK.
In reality, though, he will insist on an ever growing checklist without a compromise in velocity. And that's a great recipe for more escapes.
That's the problem with root cause analyses. Sometimes the occasional escape is totally OK if you actually analyze the costs. But they want us to "analyze" while turning a blind eye to certain things.
I've worked at places that understood this and didn't have this attitude. And places that didn't. Don't work for the latter.
BTW, I should add that I'm not trying to give a cynical take. When I first learned the five whys, I applied it to my own problems (work and otherwise). And I found it to be wholly unsatisfying. For one thing, there usually isn't a root cause. There are multiple causes at play, and you need a branching algorithm to explore the space.
More importantly, 5 (or 3) is an arbitrary number. If you keep at it, you'll almost always end up with "human nature" or "capitalism". Deciding when to stop the search is relatively arbitrary, and most people will pick a convenient endpoint.
Much simpler is:
1. What can we do to prevent this from happening again?
2. Should we solve this problem?
Expanding on the latter, I once worked at a place where my manager and his manager were militant about not polluting the codebase to solve problems caused by other tools. We'd sternly tell the customers that they need to go to the problematic tool's owner and fix them, and were ready to have that battle at senior levels.
This was in a factory environment, so there were real $$ associated with bugs. And our team had a reputation for quality, and this was one of the reasons we had that reputation. All too often people use software as a workaround, and over the years there accumulate too many workarounds. We were, in a sense, a weapon upper management wielded to ensure other tools maintained their quality.
An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.
So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.
Similarly, with "incentives are misaligned", that's fuzzy. A real root cause might be around managing time spent on bugfixes vs new feature, and the root cause is not dedicating enough time to bugfixes, and if that's because people aren't being promoted for those, it's about fixing the promotion process in a concrete way.
You can't usually just stop at fuzzy cultural/management things because you want to blame others.
> An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.
That's not an inflexible timeline. That's just a timeline.
> So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.
Because management didn't want to drop features. Hence "inflexible".
I'm not saying this is always the reason, or even most of the times. But if we can't invoke it when it is the problem, then the exercise is pointless.
> Similarly, with "incentives are misaligned", that's fuzzy.
Any generic statement is inherently fuzzy.
> You can't usually just stop at fuzzy cultural/management things because you want to blame others.
Did you really think I was advocating responding to a whole process with one liners?
The examples you gave are often ones they will not accept as root causes.
In my experience, you can weaponize processes like the Five Whys or the Amazon Leadership Principles. I don’t think that means they don’t have any value.
That being said, in this case, I agree with your manager. Both the QA team and your team had fundamental problems.
Your team (I assume) verified the functionality which included X set of changes, and then your team made Y more changes (the flag) which were not verified. Ideally, once the functionality is verified, no more changes would be permitted in order to prevent this type of problem.
The fundamental problems on the QA team were…extensive.
Usually another team's failure is covered by their own independent report. That simplifies creating the report since you don't need to collaborate closely, but also prevents shifting the blame on to anyone else (because really, both teams had failures they should have caught independently). E.g. as the last why:
Why did the testing team not catch that the feature was not functional?
If a root cause analysis is not cross team, how deep can the analysis possibly be? "Whoops, that question leads to this other process that our team doesn't directly control, guess we stop thinking about that!"
If your team doesn't control it, should you be thinking about it? Or should the team that owns it also own fixing it?
I should also have stated that based on the context I assumed this was talking about in incident report meant to be consumed internally, which I believe should be one per team. Incident reports published externally should be one single document, combining all the RCA from each individual report.
Shouldn't you be thinking about it? Whether you can control it or not, if you need to rely on it, you should be thinking about it.
Otherwise when an airplane crashes because of a defect in the aluminum, the design team's RCA will have to conclude that the root cause is "a lack of a redundant set of wings in the plane design", because they don't want to pin the blame on the materials quality inspection team mixing up some batch numbers.
If you were relying on something not to fail and it failed, your RCA should state as much. At best in the GP's case they could say "it's our fault for trusting the testing team to test the feature".
This sort of thinking is why the Japanese obliterated Detroit in the late 70's and early 80's.
The idea that if it was obvious another team messed up, you'd just ignore the problem until it got audited by a cross-functional team. All of the time and effort and materials spent between the two being wasted because nobody spoke up.
I think that's a valid question, and has plenty of ways you could go.
Is the process set up so that it's literally "throw it over the wall and you're done unless the test team contacts you"? Then arguably not. You did your job e2e and there was nothing you could've done. Doing more would've disrupted the process that's in place and taken time from other things you were assigned to. The test team should've contacted you.
BUT, well now the director has egg on his face and makes it your problem, so "should" is irrelevant; you will be thinking about it. And you ask yourself, was there something I could've done? And you know the answer is "probably".
Then, the more you think about it, you wonder, why on earth is the process set up to be so "throw it over the wall"? Isn't that stupid? All my hard work, and I don't even get to follow its progress through to production? Is this maybe also why my morale is so low? And the morale of everyone else on the team? And why testing always takes so long and misses so many bugs?
And then as you start putting things together, you realize that your director didn't assign this to you out of spite. He assigned it to you to make things better. That this isn't a form of punishment, but an opportunity to make a difference. It's something that is ultimately a director-level question. Why is the process set up like it is? The director could put together and solve with adequate time, but at that level time is on short supply, and he's putting his trust in you to analyze and flesh out, what really is the root cause for this incredibly asinine (and frightening) failure, and how can we improve as a result?
That said, in an org so broken that something like this could happen, I'm guessing the director is wanting you to do the RCA and the ten other firedrills that you're currently fighting as well, in which case, eh, fuck it. Blame the other team and move on.
If your root cause is cross team, then you wind up having to make some implicit assumptions on what the other team could have done. Is akin to ending with "because the gods got angry." Not really actionable.
This is a classic "limit the scope of the feature." You want the document to be written and constrained to someone that is in a position to impact everything they talk about. If you think there was something more holistic, push for that, as well.
Note you can discuss what other teams are doing. But do that in a way that is strictly factual. And then ask why that led your team to the failure that your team owns.
I was just commenting further on why you constrain it to your team/area of control. Should have been more clear that I meant my comment as a plus one to some of the other replies.
I oversaw an RCA once where this is exactly where they stopped. “We didn’t write this code originally so it isn’t our fault that our new feature broke it”. Repeat for 30 minutes whenever anyone says anything. We gave up.
It sounds like there’s another failure here, which you could have documented. If the test team didn’t understand what they were meant to test, that’s a failure of communication. Simply saying “they were wrong” is not sufficient exploration of the failure so, if that’s the point your manager was making, I agree with them. Blaming a third party for misunderstanding is less useful than seeking to improve the clarity of your own communication.
I love 5+ why's. I find it to be a fantastic tool in many situations. Unfortunately, when leadership does not reward a culture of learning, Five Why's can become conflated with root cause analysis and just become a directed inquiry for reaching a politically expedient cause. The bigger the fuck up, the more it needs an impartial NTSB-like focus on learning and sharing to avoid them in the future.
Fwiw, if I were your manager performing a root cause analysis, I'd mostly expect my team to be identifying contributing factors within their domain, and then we'd collate and analyze the factors with other respective teams to drill down to the root causes. I'd also have kicked back a doc that was mostly about blaming the other team.
The excellent thing I learned about 5 whys is that not only is it not really just 5, as you allude to with “5+”, but it’s also a *tree* instead of a linear list. Often a why will lead to more than one answer, and you really have to follow all branches to completion. The leaf nodes are where the changes are necessary. Rather than identifying one single thing that could have prevented the incident, you often identify many things that make the system more robust, any one of which would have prevented the incident in question.
> I'd also have kicked back a doc that was mostly about blaming the other team.
Agreed. If the test team messed up, then you need to answer the "why" your team didn't verify that the testing had actually been done. (And also why the team hadn't verified that the tool they'd sent to testing was even minimally functional to begin with.)
Five whys are necessarily scoped to a set of people responsible. For things that happen outside that scope, the whys become about selection and verification.
Validating that the build you produced works at all should be done by you, but there's also a whole team whose job it was to validate it; would you advocate for another team to test the testing teams tests?
And more to the point, how do you write a 5 why's that explains how you'd typo'd a flag to turn a feature on, and another team validated that the feature worked?
> how do you write a 5 why's that explains how you'd typo'd a flag
Seriously? Even without knowing any context, there’s a handful of universal best practices that had to Swiss cheese fail for this to even get handed off to devtest…
- Why are you adding/changing feature flag changes the day before handoff? Is there process for development freeze before handoff, e.g. only showstopper changes are made after freeze? Yes but aales asked for it so they could demo at a conference. Why don’t we have special build/deployment pipeline for experimental features that our sales / marketing engineers are asking for?
- Was it tested be developer before pushing? Yes - why did succeed at that point and fail in prod? Environment was different. Why do we not have dev environment that matches prod? Money? Time? Politics?
- Was it code reviewed? Did it get an actual review, or rubber stamped? Reviewed, but skimmed important parts only — Why was it not reviewed more carefully? Not enough time — why is there not enough time to do code reviews?
Oh, the feature flag name used underscore instead of hyphen — why did this not get flagged by style checker? Oh, so there’s no clear style conventions for feature flags and each team does their own thing…? Interesting…
Curious, do your 5 why's actually look like this, kind of stream-of-consciousness? Because I love this! Our org's 5 why's are always a linear 5 steps back that end at THE ROOT CAUSE. And those are the good ones. Others are just a list of five things that happened before or during the incident.
I've always pushed to get rid of this section of the postmortem template, or rename it, or something, because framing everything into five linear steps is never accurate, and having it be part of the template robs us of any deeper analysis or discussion because that would be redundant. But, it's hard to win against tradition and "best practices".
Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested, that someone on your team reviews before signing off.
It doesn't take a whole team. There are lots of mechanisms to produce that evidence. This is just how it works. If two checks aren't sufficient, it becomes three. Or four. Until problems stop making it through.
I think that's the point. If you have an incompetent team or team member the number of checks around them can grow astronomically and still you will have problems. At a certain point the systemic problem can become "the system is unwilling to replace this person/team with a competent one".
(That said, this is only in the case of persistent problems. Everyone can be inattentive some of the time, and a sensible quality system can be very helpful here. It's when the system tries to be a replacement for actually knowing what you're doing that things go off the rails)
while its clear that your testing team should have their own 5 Ys as well, I think its reasonable for the manager of your team asking you the question: how do we prevent this in the future? The unfortunate reality of large companies is that sometimes the quality and behaviour of other teams is (to some extent/on some time horizon) out of your control, and so the question for any given team lead is often "what can I do differently". It does seem that in this case there probably was some mechanism that your team could have had its own internal testing process that needn't replicate the full responsibilities of the testing team but could at least have caught this issue.
My first thought is why is rolling out a new system to prod that is not used yet an incident? I dont think "being in prod" is sufficient. There should be tiers of service and a brand new service should not be on a tier where it having teething issues is an incident.
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
would be interested to see the doc, but imagine you'd branch off the causes, one branch of the tree is: UAT didnt pick up the bug. why didn't UAT pick up the bug? .... (you'd need that teams help).
I think that team would have something that is a contributing cause. You shouldn't rely on UAT to pick up a bug in a released product. However just because it is not a root cause doesnt mean it shouldn't be addressed. Today's contributing cause can be tomorrow's root cause!
So yeah yiu dont blame another team but you also dont shield another team from one of their systems needing attention! The wording matters alot though.
The way you worded the question seems a little loaded. But you may be paraphrasing? 5 whys are usually more like "Why did they papaya team not detect the bug before deployment?"
Whereas
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
Is more emotive. Sounds like a cross examiners question which isn't the vibe you'd want to go for. 5 whys should be 5 drys. Nothing spicy!
It was an incident because it was important to leadership. It was a marketing targeting feature that was advertised to the local executive with some excitement by the management, so they were excited to share the results of it, and when there weren't results on the anticipated launch date, they wanted answers, which meant the manager treated it as an incident.
That's how we do it - there are "branches" to most of our RCAs, and in fact, we have separate sections for root cause analysis (things which directly or indirectly contribute to incident, which are a branched / fractal 5 whys) and lessons learned (things which did not necessarily contribute to incident but which upon reflection we can do better - frequently incident management or communication or reporting or escalation etc).
It took a while for all the teams to embrace the rca process without fear and finger pointing, but now that it's trusted and accepted, problem management stream / rca process probably the healthiest / best viewed of our streams and processes :-)
they way i handle this with my teams: any bugs caught by the QA team go against the developers. any bugs caught after QA green lights the go live go against the QA team. (Of course, discounting any bugs that are deemed acceptable for go live by the PM).
General trick in any project management is try to arrange for the work to be done by the group that has the most influence over it.
Just as you should never take responsibility over something you are given no power over, you should move responsibility to where the power is (and if they won't take responsibility, you start carving out the edges of their power and hand it over to adjacent groups who will).
I learned pretty early how to hack the 5 Why's in order to make sure something actionable but neither trivial nor overwhelming gets chosen as the response. And I often do it early enough in the meeting that I'm difficult to catch doing it.
If I don't get invited I will sometimes crash the party, especially if the last analysis resulted only in performative work and no real progress. You get one, maybe two, and then I'm up in your business because mistakes don't mean you're stupid, but learning nothing from them does.
5 why's can be very political. You can make it take whatever direction you want to tell what ever story you want. I don't get why it's cargo culted the way it is
While that might be true, the five whys is notorious for slipping into a destructive "you/I suck and firing you/I solves the problem for good and I believe it makes everyone absolutely happy" style of false conclusions.
Reportedly Toyota has organizational mitigations for that problem or reportedly the working culture there isn't so great after all. The bottom line is, it's a double edged sword to say the very least.
At a large cloud provider I held a role for a bit in the “safety” organization that was tasked with developing better understanding of our incidents, working on tooling to protect systems, and so on.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?
I've applied for a couple jobs like this and was somewhat relieved they didn't call me back.
When you move thinking about reliability or safety outside of the teams that generate the problems, you replace self reflection with scolding, and you have to either cajole people to make changes for you or jump into code you're not spending enough time with to truly understand. And then if you make a mistake this is evidence that you shouldn't be touching it at all. See we told you this would end badly.
Yeah I think that’s accurate. The org had good intentions and owned some reasonable programs, but more and more became basically a security organization focused on cutting off hands to prevent people from touching keyboards, rather than addressing real systemic risks and patterns of operator behaviour leading to incidents.
I did a very long RCA on a problem. My management at the time was really BIG into looking at ALL THE CAUSES. They wanted HUGE fishbone diagrams to show that we had looked at everything. This was in the days of having huge drum plotters, so the diagrams could be 36" and many feet long.
So I did what they wanted and the root cause was:
On December 11 1963 Mr and Mrs Stanley Smith had sexual intercourse.
I got asked what that had to do with anything and I said, "If you look up a few lines you'll see that the issue was a human error caused by Bob Smith, if he hadn't been born we wouldn't have had this problem and I just went back to the actual conception date."
I got asked how I was able to pin it to that date and said "I asked Bob what his father's birthday was and extrapolated that info"
I model RCAs after my understanding of NTSB and incident response after my understanding of NASA command centers.
They're both flawed but often replace something that works 3x worse than my caricature of both.
The findings should always result in a material change that is worth at least the effort of having done it. Not just a checkbox that proves we did something. The investment in the mitigation should honor the consequences of the failure, and the uniqueness of the failure. Or rather, the lack of uniqueness. As a failure repeats in kind (eg, a bunch of 737 Maxes crashing), trust that the system is put in jeopardy. By the time a problem has happened three times, the response should begin to resemble penance.
So how do we get the problem not to hit production again, or how do we at least keep it happening due to the exact same error?
And for some failure modes, we need to project the consequences going forward. Let's say you find your app is occasionally crashing over a weekend because of memory leaks, plus the lack of Continuous Deployment forcibly restarting the services. We can predict this problem will happen reliably on Memorial Day, and Labor Day. So we need to do something relatively serious now.
But it'll also get much worse on Thanksgiving weekend, and just stupid around Christmas, when we have code freezes. So we do something to get us through Memorial Day but we also need a second story near the top of the backlog that needs to be done by Labor Day, Thanksgiving at the latest. But we don't necessarily have to do that story next sprint.
Hah #7 really hits home. Every RCA I’ve been a part of always ends up pointing to systemic failures in the org at the top level, because walking the tree always leads there. You can’t blame any one person or system for a failure in isolation. It’s usually some form of, “this is ultimately a consequence of miscalibrating risk associated to business/financial decisions.”
I forget where I heard this but, “you manage risk, but risk cannot be managed.” Ie. there is no terminal state where it’s been “solved.” It’s much like “culture of safety.”
Ultimately this is what management is at least nominally supposed to do: decide values, set standards, and ensure that people down the chain are strategically aligned and have an environment where everyone is working together. That includes Andy Grove's "getting people to resolve their own differences" and other meta tasks. But fundamentally the person in charge is responsible for everything, even though they can't know everything.
Any RCA that doesn't provide useful feedback to management up and down the chain is missing pieces, but there's lots of discussion elsewhere in this thread about that by people better at elaborating than I am.
If you get a bunch of analyses that point to underinvestment in x in order to achieve y, and you can measure that this is losing money, then the top level recalibrates.
It's not about blame, it's about course-correcting including at the top level. They can't do that course-correcting without these analyses.
As one of my early mentorish people used to say: sometimes you have to let the baby fall down.
Before that I hated it when people confessed to me that they knew a problem was going to happen and they did nothing to stop it. But the problem with doing things right the first time is that nobody appreciates how hard it is. You spend all your political capital trying to stop the company from making a mistake it desperately wants to make. You get no credit when it doesn't happen because they already figured it wouldn't. And you don't get any buy-in for keeping it from repeating in the future. So now you have this plate you have to spin all by yourself, and nobody who can hand your more control gives a shit other than the fact that you seem like an asshole so we aren't going to promote you.
You can be clever though and put systems in place where effectively the people squished in the middle now have new places where they can just say 'no' because the process requires it.
That is sort of how Agile got out there in the first place. Open feedback loops letting management be stupid for so long before the consequences were apparent that they could duck their vast contributions to it.
Saying "I was wrong" can be a very powerful debate tool, especially if you can follow it up with concrete actions to fix the problem.
There were tools I wrote at my last job because basically an incident could IMO be tracked back to, "I was on step 5 and George interrupted me to ask a question of some other high priority effort, and when I got back I forgot to finish step 5". The more familiar you are with a runbook, the more your brain will confuse memories from an hour or a day ago with memories of the same routine at a different time.
It's literally "did I turn off the stove". You remember turning off the stove. Many, many times. But did you turn off both burners you turned on for lunch? You're certain about one. But what about the other? That's a blur.
But we're software developers. The more mundane a task, the more likely that we can replace it with a program to do the same thing. And as you get more familiar with the task you keep adding more to the automatic parts until one day it's just a couple buttons that can be reliably pushed during peak traffic on your services. Just make sure there isn't an active outage going on when you push the buttons.
Author here. Please note this is an early draft/stream-of-consciousness. Feel free to read and share anyway but my actual published articles hold a higher standard!
I caught your related comments and eventual link to this post in another HN thread earlier this week and really liked them / it. I'm glad you posted it by itself!
A lot of this is based on heavy assumptions about systems and risk/safety analysis. The biggest assumption this post is making is that humans should be involved at all.
Systems do not have to facilitate operators in building accurate mental models. In fact, safe systems disregard mental models, because a mental model is a human thing, and humans are fallible. Remove the human and you have a safer system.
Safety is not a dynamic control problem, it's a quality-state management problem. You need to maintain a state of quality assurance/quality control to ensure safety. When it falters, so does safety. Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Also fwiw, there's often not a root cause, but instead multiple causes (or coincidental states). For getting better at tracking down causes of failures (and preventing them), I recommend learning the Toyota Production System, then reading Out Of The Crisis. That'll kickstart your brain enough to be more effective than 99.99% of people.
It should be more clear in the article that this term is used more broadly. The "mental model" is the function that converts feedback into control actions. In that sense, even simple automated controllers have "mental models".
> Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Sometimes it isn't but often it is. Yes, that makes the system more complex, along with other properties such as organicism, non-linearity, parallelism, interactivity, long-term recurrences, etc. These properties are inconvenient but we cannot just wish them away. We have to design for the system we have.
> Toyota, Deming, SPC
I have read more books in this area than most people, and I don't really see how you came away without an appreciation for the importance of humans in the loop, and the dangers of overautomation. Could you illustrate more clearly what you mean?
I agree with a lot of the statements at the top of the article, but some of them are just nonsense. This one, in particular:
> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.
Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.
> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.
I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.
Sure, more is always better. Practically, though, we are trading depth for breadth. In my experience, many problems that look dissimilar after a shallow analysis turn out to be caused by the same thing when analysed in depth. In that case, it is more economical to analyse fewer incidents in greater depth and actually find their common factors, rather than make a shallow pass over many incidents and continue to paper over symptoms of the undiscovered deeper problem.
Good RCA: Produce some useful documentation to prevent issue from recurring.
Fantastic RCA: Remove requirement that caused the action that resulted in the problem occurring.
Bad RCA: Lets get 12 non technical people on a call to ask the on call engineer who is tired from 6 hours managing the fault, a bunch of technical questions they don't understand the answers to anyway.
(Worst possible fault practice is to bring in a bunch of stakeholders and force the engineer to be on a call with them while they try and investigate the fault)
Worst RCA: A half paragraph describing the problem in the most general terms to meet a contractual RCA requirement.
Not all problems (and systems) are alike. And probably simple approaches like Occam's Razor will work good enough with most. But the remaining 10% will need deeper digging into more data and correlations.
Root cause works better if you can come back next time the same thing happens and find a different root cause to fix. keep repeating until the problem doesn't happen enough to care anymore.
If the result/accident is too bad though you need to find all the different faults and mitigate as manyias possible the first time.
> Root cause works better if you can come back next time the same thing happens and find a different root cause to fix. keep repeating until the problem doesn't happen enough to care anymore.
This sounds like continuously firefighting to paper over symptoms rather than address the problems at a deeper level.
I feel like half the time issues are caused by adding some stupid feature that nobody really wants, but makes it in anyways because the incentive is to add features, not make good software.
People rarely react well if you tell them "Hey this feature ticket you made is poorly conceived and will cause problems, can we just not do it?" It is easier just to implement whatever it is and deal with the fallout later.
It's hard to prove the cost to a feature or bug fix or library upgrade we desperately need has been doubled by all of the features we didn't need.
My 'favorite' is when we implement stupid, self-limiting, corner-painting features for a customer who leaves us anyway. Or who we never manage to make money from.
Many years ago, I worked at Amazon, and it was at the time quite fond of the "five whys" approach to root cause analysis: say what happened, ask why that happened, ask why that in turn happened, and keep going until you get to some very fundamental problem.
I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."
I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.
My litmus test for these types of processes: If root causes like "Inflexible with timelines", or "Incentives are misaligned (e.g. prioritizing career over quality)" are not permitted, the whole process is a waste of time.
Edit: You can see others commenting on precisely this. Examples:
https://news.ycombinator.com/item?id=45573027
https://news.ycombinator.com/item?id=45573101
https://news.ycombinator.com/item?id=45572561
https://news.ycombinator.com/item?id=45572561
Usually things like that have to end up in retrospectives, and the first thing I hated about Scrum (or maybe just the first Scrum team, though they tried really hard to follow the letter of the process) was that you basically had to know about a problem for 5-7 weeks before you could get anyone to act on it. Because the uncomfortable items had to repeat at least 3 times before people wanted to look at them.
This was and is torture to me. I'm not going to fuck something up on purpose just to make the paperwork look good if I can tell ten minutes in that this is a stupid way to do it and I should be doing something else first.
Many developers have strong opinions that certain parts of a process are valuable and others aren't, and will try quite hard to align your process with their opinions as quickly as possible. For an organisation that doesn't know which developers' strongly held views are right and which are not, requiring everyone to try something for 5-7 weeks is probably more productive than any other approach they could take.
Having done quite a bit of politicing at a centuries old, med sized company, I can tell you that what management wants you is the assurance that this particular problem won’t happen again. Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting Though what I’ve found is that if you have enough clout you can add an addendum to the root cause analysis, and you can start getting into things like misaligned incentives. But always keep in mind, at best you can only point out this will mean this class of problem will keep happening.
If you do this, know that there be dragons. You have to be very careful here, because for any sufficiently large company, misaligned incentives are largely defined by the org chart and it’s boudaries. You will be adding fuel to politics that is likely above your pay grade, and the fallout can be career changing. I was lucky to have a neutral reputation, as someine who cared more about the product than personal gain. So I got a lot of leeway when I said tonedeaf things. Even still I ended up in crosshairs once or twice in the 10 years I was at the company for having opinions about systemic problems.
> Having done quite a bit of politicing at a centuries old, med sized company, I can tell you that what management wants you is the assurance that this particular problem won’t happen again.
I'm not disagreeing. I'm saying they should phrase it this way (and some do), instead of masking it with an insincere request for root causing.
> Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting
Occasionally this is the right thing to do. And often this results in a very long checklist that slows the whole development down because they don't want to do a cost-benefit analysis of whether not having an occasional escape is worth the decrease in productivity. And this is because the incentives for the manager is such that the occasional escape is not OK.
In reality, though, he will insist on an ever growing checklist without a compromise in velocity. And that's a great recipe for more escapes.
That's the problem with root cause analyses. Sometimes the occasional escape is totally OK if you actually analyze the costs. But they want us to "analyze" while turning a blind eye to certain things.
I've worked at places that understood this and didn't have this attitude. And places that didn't. Don't work for the latter.
BTW, I should add that I'm not trying to give a cynical take. When I first learned the five whys, I applied it to my own problems (work and otherwise). And I found it to be wholly unsatisfying. For one thing, there usually isn't a root cause. There are multiple causes at play, and you need a branching algorithm to explore the space.
More importantly, 5 (or 3) is an arbitrary number. If you keep at it, you'll almost always end up with "human nature" or "capitalism". Deciding when to stop the search is relatively arbitrary, and most people will pick a convenient endpoint.
Much simpler is:
1. What can we do to prevent this from happening again?
2. Should we solve this problem?
Expanding on the latter, I once worked at a place where my manager and his manager were militant about not polluting the codebase to solve problems caused by other tools. We'd sternly tell the customers that they need to go to the problematic tool's owner and fix them, and were ready to have that battle at senior levels.
This was in a factory environment, so there were real $$ associated with bugs. And our team had a reputation for quality, and this was one of the reasons we had that reputation. All too often people use software as a workaround, and over the years there accumulate too many workarounds. We were, in a sense, a weapon upper management wielded to ensure other tools maintained their quality.
Engineering is the art of making do with what you've got and sometimes you have to treat such unreasonable i positions just like any other constraint
But are those really the root causes?
An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.
So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.
Similarly, with "incentives are misaligned", that's fuzzy. A real root cause might be around managing time spent on bugfixes vs new feature, and the root cause is not dedicating enough time to bugfixes, and if that's because people aren't being promoted for those, it's about fixing the promotion process in a concrete way.
You can't usually just stop at fuzzy cultural/management things because you want to blame others.
> An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.
That's not an inflexible timeline. That's just a timeline.
> So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.
Because management didn't want to drop features. Hence "inflexible".
I'm not saying this is always the reason, or even most of the times. But if we can't invoke it when it is the problem, then the exercise is pointless.
> Similarly, with "incentives are misaligned", that's fuzzy.
Any generic statement is inherently fuzzy.
> You can't usually just stop at fuzzy cultural/management things because you want to blame others.
Did you really think I was advocating responding to a whole process with one liners?
The examples you gave are often ones they will not accept as root causes.
In my experience, you can weaponize processes like the Five Whys or the Amazon Leadership Principles. I don’t think that means they don’t have any value.
That being said, in this case, I agree with your manager. Both the QA team and your team had fundamental problems.
Your team (I assume) verified the functionality which included X set of changes, and then your team made Y more changes (the flag) which were not verified. Ideally, once the functionality is verified, no more changes would be permitted in order to prevent this type of problem.
The fundamental problems on the QA team were…extensive.
Usually another team's failure is covered by their own independent report. That simplifies creating the report since you don't need to collaborate closely, but also prevents shifting the blame on to anyone else (because really, both teams had failures they should have caught independently). E.g. as the last why:
Why did the testing team not catch that the feature was not functional?
This is covered by LINK
If a root cause analysis is not cross team, how deep can the analysis possibly be? "Whoops, that question leads to this other process that our team doesn't directly control, guess we stop thinking about that!"
If your team doesn't control it, should you be thinking about it? Or should the team that owns it also own fixing it?
I should also have stated that based on the context I assumed this was talking about in incident report meant to be consumed internally, which I believe should be one per team. Incident reports published externally should be one single document, combining all the RCA from each individual report.
Shouldn't you be thinking about it? Whether you can control it or not, if you need to rely on it, you should be thinking about it.
Otherwise when an airplane crashes because of a defect in the aluminum, the design team's RCA will have to conclude that the root cause is "a lack of a redundant set of wings in the plane design", because they don't want to pin the blame on the materials quality inspection team mixing up some batch numbers.
If you were relying on something not to fail and it failed, your RCA should state as much. At best in the GP's case they could say "it's our fault for trusting the testing team to test the feature".
This sort of thinking is why the Japanese obliterated Detroit in the late 70's and early 80's.
The idea that if it was obvious another team messed up, you'd just ignore the problem until it got audited by a cross-functional team. All of the time and effort and materials spent between the two being wasted because nobody spoke up.
I think that's a valid question, and has plenty of ways you could go.
Is the process set up so that it's literally "throw it over the wall and you're done unless the test team contacts you"? Then arguably not. You did your job e2e and there was nothing you could've done. Doing more would've disrupted the process that's in place and taken time from other things you were assigned to. The test team should've contacted you.
BUT, well now the director has egg on his face and makes it your problem, so "should" is irrelevant; you will be thinking about it. And you ask yourself, was there something I could've done? And you know the answer is "probably".
Then, the more you think about it, you wonder, why on earth is the process set up to be so "throw it over the wall"? Isn't that stupid? All my hard work, and I don't even get to follow its progress through to production? Is this maybe also why my morale is so low? And the morale of everyone else on the team? And why testing always takes so long and misses so many bugs?
And then as you start putting things together, you realize that your director didn't assign this to you out of spite. He assigned it to you to make things better. That this isn't a form of punishment, but an opportunity to make a difference. It's something that is ultimately a director-level question. Why is the process set up like it is? The director could put together and solve with adequate time, but at that level time is on short supply, and he's putting his trust in you to analyze and flesh out, what really is the root cause for this incredibly asinine (and frightening) failure, and how can we improve as a result?
That said, in an org so broken that something like this could happen, I'm guessing the director is wanting you to do the RCA and the ten other firedrills that you're currently fighting as well, in which case, eh, fuck it. Blame the other team and move on.
So what do heartbleed and log4shell RCAs look like for your internal teams? “A necessary source library screwed up, not our problem”?
If your root cause is cross team, then you wind up having to make some implicit assumptions on what the other team could have done. Is akin to ending with "because the gods got angry." Not really actionable.
This is a classic "limit the scope of the feature." You want the document to be written and constrained to someone that is in a position to impact everything they talk about. If you think there was something more holistic, push for that, as well.
Note you can discuss what other teams are doing. But do that in a way that is strictly factual. And then ask why that led your team to the failure that your team owns.
If the dynamic is cross functional you need to reschedule the post mortem and invite the other team to the meeting.
This is literally a "the right people aren't in the room" issue.
> on what the other team could have done
If you're wondering what anyone "could have done", you've already missed the point of the article completely.
I was just commenting further on why you constrain it to your team/area of control. Should have been more clear that I meant my comment as a plus one to some of the other replies.
Ah, thank you and sorry for assuming.
I oversaw an RCA once where this is exactly where they stopped. “We didn’t write this code originally so it isn’t our fault that our new feature broke it”. Repeat for 30 minutes whenever anyone says anything. We gave up.
Pretty deep. It forces you to account for failures in other domains
> root
You keep using that word. I do not think it means what you think it means.
It sounds like there’s another failure here, which you could have documented. If the test team didn’t understand what they were meant to test, that’s a failure of communication. Simply saying “they were wrong” is not sufficient exploration of the failure so, if that’s the point your manager was making, I agree with them. Blaming a third party for misunderstanding is less useful than seeking to improve the clarity of your own communication.
I love 5+ why's. I find it to be a fantastic tool in many situations. Unfortunately, when leadership does not reward a culture of learning, Five Why's can become conflated with root cause analysis and just become a directed inquiry for reaching a politically expedient cause. The bigger the fuck up, the more it needs an impartial NTSB-like focus on learning and sharing to avoid them in the future.
Fwiw, if I were your manager performing a root cause analysis, I'd mostly expect my team to be identifying contributing factors within their domain, and then we'd collate and analyze the factors with other respective teams to drill down to the root causes. I'd also have kicked back a doc that was mostly about blaming the other team.
The excellent thing I learned about 5 whys is that not only is it not really just 5, as you allude to with “5+”, but it’s also a *tree* instead of a linear list. Often a why will lead to more than one answer, and you really have to follow all branches to completion. The leaf nodes are where the changes are necessary. Rather than identifying one single thing that could have prevented the incident, you often identify many things that make the system more robust, any one of which would have prevented the incident in question.
> I'd also have kicked back a doc that was mostly about blaming the other team.
Agreed. If the test team messed up, then you need to answer the "why" your team didn't verify that the testing had actually been done. (And also why the team hadn't verified that the tool they'd sent to testing was even minimally functional to begin with.)
Five whys are necessarily scoped to a set of people responsible. For things that happen outside that scope, the whys become about selection and verification.
Quis turmas probationum examinat?
Validating that the build you produced works at all should be done by you, but there's also a whole team whose job it was to validate it; would you advocate for another team to test the testing teams tests?
And more to the point, how do you write a 5 why's that explains how you'd typo'd a flag to turn a feature on, and another team validated that the feature worked?
> how do you write a 5 why's that explains how you'd typo'd a flag
Seriously? Even without knowing any context, there’s a handful of universal best practices that had to Swiss cheese fail for this to even get handed off to devtest…
- Why are you adding/changing feature flag changes the day before handoff? Is there process for development freeze before handoff, e.g. only showstopper changes are made after freeze? Yes but aales asked for it so they could demo at a conference. Why don’t we have special build/deployment pipeline for experimental features that our sales / marketing engineers are asking for?
- Was it tested be developer before pushing? Yes - why did succeed at that point and fail in prod? Environment was different. Why do we not have dev environment that matches prod? Money? Time? Politics?
- Was it code reviewed? Did it get an actual review, or rubber stamped? Reviewed, but skimmed important parts only — Why was it not reviewed more carefully? Not enough time — why is there not enough time to do code reviews? Oh, the feature flag name used underscore instead of hyphen — why did this not get flagged by style checker? Oh, so there’s no clear style conventions for feature flags and each team does their own thing…? Interesting…
Etc etc.
Curious, do your 5 why's actually look like this, kind of stream-of-consciousness? Because I love this! Our org's 5 why's are always a linear 5 steps back that end at THE ROOT CAUSE. And those are the good ones. Others are just a list of five things that happened before or during the incident.
I've always pushed to get rid of this section of the postmortem template, or rename it, or something, because framing everything into five linear steps is never accurate, and having it be part of the template robs us of any deeper analysis or discussion because that would be redundant. But, it's hard to win against tradition and "best practices".
Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested, that someone on your team reviews before signing off.
It doesn't take a whole team. There are lots of mechanisms to produce that evidence. This is just how it works. If two checks aren't sufficient, it becomes three. Or four. Until problems stop making it through.
> Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested
Once you find out the heart surgeon shows up drunk to the operating room, you make sure there is an additional nurse there to hold his arm steady.
:P I mean, obviously assuming you don't have the choice of changing your testing team. But even if you do, what if they're worse?
I... with the evocative scenario... would choose another remedy, rather than have a nurse steady the drunken surgeon's arm.
I think that's the point. If you have an incompetent team or team member the number of checks around them can grow astronomically and still you will have problems. At a certain point the systemic problem can become "the system is unwilling to replace this person/team with a competent one".
(That said, this is only in the case of persistent problems. Everyone can be inattentive some of the time, and a sensible quality system can be very helpful here. It's when the system tries to be a replacement for actually knowing what you're doing that things go off the rails)
while its clear that your testing team should have their own 5 Ys as well, I think its reasonable for the manager of your team asking you the question: how do we prevent this in the future? The unfortunate reality of large companies is that sometimes the quality and behaviour of other teams is (to some extent/on some time horizon) out of your control, and so the question for any given team lead is often "what can I do differently". It does seem that in this case there probably was some mechanism that your team could have had its own internal testing process that needn't replicate the full responsibilities of the testing team but could at least have caught this issue.
A very relatable experience, lot of pressure to stop the Whys at the dev team and not question larger leadership or organizational moves
Interesting one.
My first thought is why is rolling out a new system to prod that is not used yet an incident? I dont think "being in prod" is sufficient. There should be tiers of service and a brand new service should not be on a tier where it having teething issues is an incident.
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
would be interested to see the doc, but imagine you'd branch off the causes, one branch of the tree is: UAT didnt pick up the bug. why didn't UAT pick up the bug? .... (you'd need that teams help).
I think that team would have something that is a contributing cause. You shouldn't rely on UAT to pick up a bug in a released product. However just because it is not a root cause doesnt mean it shouldn't be addressed. Today's contributing cause can be tomorrow's root cause!
So yeah yiu dont blame another team but you also dont shield another team from one of their systems needing attention! The wording matters alot though.
The way you worded the question seems a little loaded. But you may be paraphrasing? 5 whys are usually more like "Why did they papaya team not detect the bug before deployment?"
Whereas
> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
Is more emotive. Sounds like a cross examiners question which isn't the vibe you'd want to go for. 5 whys should be 5 drys. Nothing spicy!
It was an incident because it was important to leadership. It was a marketing targeting feature that was advertised to the local executive with some excitement by the management, so they were excited to share the results of it, and when there weren't results on the anticipated launch date, they wanted answers, which meant the manager treated it as an incident.
Oh geez! That is very bad (almost pointy haired) use of an incident process.
That's how we do it - there are "branches" to most of our RCAs, and in fact, we have separate sections for root cause analysis (things which directly or indirectly contribute to incident, which are a branched / fractal 5 whys) and lessons learned (things which did not necessarily contribute to incident but which upon reflection we can do better - frequently incident management or communication or reporting or escalation etc).
It took a while for all the teams to embrace the rca process without fear and finger pointing, but now that it's trusted and accepted, problem management stream / rca process probably the healthiest / best viewed of our streams and processes :-)
The next org you went to, did they also use the Five Whys or did they get by with Four True Colors instead?
they way i handle this with my teams: any bugs caught by the QA team go against the developers. any bugs caught after QA green lights the go live go against the QA team. (Of course, discounting any bugs that are deemed acceptable for go live by the PM).
General trick in any project management is try to arrange for the work to be done by the group that has the most influence over it.
Just as you should never take responsibility over something you are given no power over, you should move responsibility to where the power is (and if they won't take responsibility, you start carving out the edges of their power and hand it over to adjacent groups who will).
I learned pretty early how to hack the 5 Why's in order to make sure something actionable but neither trivial nor overwhelming gets chosen as the response. And I often do it early enough in the meeting that I'm difficult to catch doing it.
If I don't get invited I will sometimes crash the party, especially if the last analysis resulted only in performative work and no real progress. You get one, maybe two, and then I'm up in your business because mistakes don't mean you're stupid, but learning nothing from them does.
5 why's can be very political. You can make it take whatever direction you want to tell what ever story you want. I don't get why it's cargo culted the way it is
No, people can be very political. It doesn't matter what the process is.
Hell, people even legislated the value of PI that one time.
While that might be true, the five whys is notorious for slipping into a destructive "you/I suck and firing you/I solves the problem for good and I believe it makes everyone absolutely happy" style of false conclusions.
Reportedly Toyota has organizational mitigations for that problem or reportedly the working culture there isn't so great after all. The bottom line is, it's a double edged sword to say the very least.
> very fundamental problem.
...5 billion years ago, the Earth coalesced from the dust cloud around the Sun...
At a large cloud provider I held a role for a bit in the “safety” organization that was tasked with developing better understanding of our incidents, working on tooling to protect systems, and so on.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?
I've applied for a couple jobs like this and was somewhat relieved they didn't call me back.
When you move thinking about reliability or safety outside of the teams that generate the problems, you replace self reflection with scolding, and you have to either cajole people to make changes for you or jump into code you're not spending enough time with to truly understand. And then if you make a mistake this is evidence that you shouldn't be touching it at all. See we told you this would end badly.
Yeah I think that’s accurate. The org had good intentions and owned some reasonable programs, but more and more became basically a security organization focused on cutting off hands to prevent people from touching keyboards, rather than addressing real systemic risks and patterns of operator behaviour leading to incidents.
I did a very long RCA on a problem. My management at the time was really BIG into looking at ALL THE CAUSES. They wanted HUGE fishbone diagrams to show that we had looked at everything. This was in the days of having huge drum plotters, so the diagrams could be 36" and many feet long.
So I did what they wanted and the root cause was:
On December 11 1963 Mr and Mrs Stanley Smith had sexual intercourse.
I got asked what that had to do with anything and I said, "If you look up a few lines you'll see that the issue was a human error caused by Bob Smith, if he hadn't been born we wouldn't have had this problem and I just went back to the actual conception date."
I got asked how I was able to pin it to that date and said "I asked Bob what his father's birthday was and extrapolated that info"
I was never asked to do a RCA again.
I model RCAs after my understanding of NTSB and incident response after my understanding of NASA command centers.
They're both flawed but often replace something that works 3x worse than my caricature of both.
The findings should always result in a material change that is worth at least the effort of having done it. Not just a checkbox that proves we did something. The investment in the mitigation should honor the consequences of the failure, and the uniqueness of the failure. Or rather, the lack of uniqueness. As a failure repeats in kind (eg, a bunch of 737 Maxes crashing), trust that the system is put in jeopardy. By the time a problem has happened three times, the response should begin to resemble penance.
So how do we get the problem not to hit production again, or how do we at least keep it happening due to the exact same error?
And for some failure modes, we need to project the consequences going forward. Let's say you find your app is occasionally crashing over a weekend because of memory leaks, plus the lack of Continuous Deployment forcibly restarting the services. We can predict this problem will happen reliably on Memorial Day, and Labor Day. So we need to do something relatively serious now.
But it'll also get much worse on Thanksgiving weekend, and just stupid around Christmas, when we have code freezes. So we do something to get us through Memorial Day but we also need a second story near the top of the backlog that needs to be done by Labor Day, Thanksgiving at the latest. But we don't necessarily have to do that story next sprint.
Some of the same thoughts in Richard Cook, which was a brain-altering read for me:
https://how.complexsystems.fail/
Hah #7 really hits home. Every RCA I’ve been a part of always ends up pointing to systemic failures in the org at the top level, because walking the tree always leads there. You can’t blame any one person or system for a failure in isolation. It’s usually some form of, “this is ultimately a consequence of miscalibrating risk associated to business/financial decisions.”
I forget where I heard this but, “you manage risk, but risk cannot be managed.” Ie. there is no terminal state where it’s been “solved.” It’s much like “culture of safety.”
I was pondering this a bit recently while going through The Wire
Though unsatisfying it feels like a lot boils down to "shit rolls downhill" or "fish rot from the head down"
Ultimately this is what management is at least nominally supposed to do: decide values, set standards, and ensure that people down the chain are strategically aligned and have an environment where everyone is working together. That includes Andy Grove's "getting people to resolve their own differences" and other meta tasks. But fundamentally the person in charge is responsible for everything, even though they can't know everything.
Any RCA that doesn't provide useful feedback to management up and down the chain is missing pieces, but there's lots of discussion elsewhere in this thread about that by people better at elaborating than I am.
Of course, and that's very much the point.
If you get a bunch of analyses that point to underinvestment in x in order to achieve y, and you can measure that this is losing money, then the top level recalibrates.
It's not about blame, it's about course-correcting including at the top level. They can't do that course-correcting without these analyses.
As one of my early mentorish people used to say: sometimes you have to let the baby fall down.
Before that I hated it when people confessed to me that they knew a problem was going to happen and they did nothing to stop it. But the problem with doing things right the first time is that nobody appreciates how hard it is. You spend all your political capital trying to stop the company from making a mistake it desperately wants to make. You get no credit when it doesn't happen because they already figured it wouldn't. And you don't get any buy-in for keeping it from repeating in the future. So now you have this plate you have to spin all by yourself, and nobody who can hand your more control gives a shit other than the fact that you seem like an asshole so we aren't going to promote you.
This is the hidden cost of heroics[1]: it severs an important information channel.
[1]: https://entropicthoughts.com/hidden-cost-of-heroics
> “you manage risk, but risk cannot be managed.” Ie. there is no terminal state where it’s been “solved.”
I often hear people misinterpret "risk management" as if it means "risk minimisation", but this is the first time I hear of "risk elimination"!
Risk management is about finding an appropriate level of risk. Could be lower, could be higher.
You can be clever though and put systems in place where effectively the people squished in the middle now have new places where they can just say 'no' because the process requires it.
That is sort of how Agile got out there in the first place. Open feedback loops letting management be stupid for so long before the consequences were apparent that they could duck their vast contributions to it.
For a deeper dive there’s a somewhat old but excellent book on most of these points called Normal Accidents. [0]
[0] https://en.wikipedia.org/wiki/Normal_Accidents
This paper so affected me that I scrapped a talk I was writing three days ahead of an (internal) conference and wrote a talk about this paper instead!
I think #7 strikes (but barely misses) the point - root cause analysis is not root blame analysis- but we often combine them in our mind.
Thanks for the article and shoutout - CAST is great and I use it extensively with tech teams.
Causal Analysis based on Systems Theory - my notes - https://github.com/joelparkerhenderson/causal-analysis-based...
The full handbook by Nancy G. Leveson at MIT is free here: http://sunnyday.mit.edu/CAST-Handbook.pdf
Please keep working on that piece I think it will be very useful for incident reviewers.
Someone said the quiet part loud! :
"""
Common circumstances missing from accident reports are:
Pressures to cut costs or work quicker,
Competing requests for colleagues,
Unnecessarily complicated systems,
Broken tools,
Biological needs (e.g. sleep or hunger),
Cumbersome enforced processes,
Fear of being consequences of doing something out of the ordinary, and
Shame of feeling in over one’s head.
"""
Saying "I was wrong" can be a very powerful debate tool, especially if you can follow it up with concrete actions to fix the problem.
There were tools I wrote at my last job because basically an incident could IMO be tracked back to, "I was on step 5 and George interrupted me to ask a question of some other high priority effort, and when I got back I forgot to finish step 5". The more familiar you are with a runbook, the more your brain will confuse memories from an hour or a day ago with memories of the same routine at a different time.
It's literally "did I turn off the stove". You remember turning off the stove. Many, many times. But did you turn off both burners you turned on for lunch? You're certain about one. But what about the other? That's a blur.
But we're software developers. The more mundane a task, the more likely that we can replace it with a program to do the same thing. And as you get more familiar with the task you keep adding more to the automatic parts until one day it's just a couple buttons that can be reliably pushed during peak traffic on your services. Just make sure there isn't an active outage going on when you push the buttons.
Tip. Every runbook should have this first step:
1. Create a document from template (xyz), put it in this share location (abc) and fill it in as you perform each step.
I am blessed with a bad memory so I do this anyway! But not everyone has that "advantage".
I think the tools should have this builtin.
Author here. Please note this is an early draft/stream-of-consciousness. Feel free to read and share anyway but my actual published articles hold a higher standard!
I caught your related comments and eventual link to this post in another HN thread earlier this week and really liked them / it. I'm glad you posted it by itself!
A lot of this is based on heavy assumptions about systems and risk/safety analysis. The biggest assumption this post is making is that humans should be involved at all.
Systems do not have to facilitate operators in building accurate mental models. In fact, safe systems disregard mental models, because a mental model is a human thing, and humans are fallible. Remove the human and you have a safer system.
Safety is not a dynamic control problem, it's a quality-state management problem. You need to maintain a state of quality assurance/quality control to ensure safety. When it falters, so does safety. Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Also fwiw, there's often not a root cause, but instead multiple causes (or coincidental states). For getting better at tracking down causes of failures (and preventing them), I recommend learning the Toyota Production System, then reading Out Of The Crisis. That'll kickstart your brain enough to be more effective than 99.99% of people.
> mental model
It should be more clear in the article that this term is used more broadly. The "mental model" is the function that converts feedback into control actions. In that sense, even simple automated controllers have "mental models".
> Dynamism is sometimes not a factor (although when it is, it's typically a harmful factor).
Sometimes it isn't but often it is. Yes, that makes the system more complex, along with other properties such as organicism, non-linearity, parallelism, interactivity, long-term recurrences, etc. These properties are inconvenient but we cannot just wish them away. We have to design for the system we have.
> Toyota, Deming, SPC
I have read more books in this area than most people, and I don't really see how you came away without an appreciation for the importance of humans in the loop, and the dangers of overautomation. Could you illustrate more clearly what you mean?
I agree with a lot of the statements at the top of the article, but some of them are just nonsense. This one, in particular:
> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.
Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.
> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.
I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.
> you really need as much breadth as you can get
Sure, more is always better. Practically, though, we are trading depth for breadth. In my experience, many problems that look dissimilar after a shallow analysis turn out to be caused by the same thing when analysed in depth. In that case, it is more economical to analyse fewer incidents in greater depth and actually find their common factors, rather than make a shallow pass over many incidents and continue to paper over symptoms of the undiscovered deeper problem.
Good RCA: Produce some useful documentation to prevent issue from recurring.
Fantastic RCA: Remove requirement that caused the action that resulted in the problem occurring.
Bad RCA: Lets get 12 non technical people on a call to ask the on call engineer who is tired from 6 hours managing the fault, a bunch of technical questions they don't understand the answers to anyway.
(Worst possible fault practice is to bring in a bunch of stakeholders and force the engineer to be on a call with them while they try and investigate the fault)
Worst RCA: A half paragraph describing the problem in the most general terms to meet a contractual RCA requirement.
I recommend attending the next STAMP Workshop offered by MIT if you have a chance: https://psas.scripts.mit.edu/home/stamp-workshops
Not all problems (and systems) are alike. And probably simple approaches like Occam's Razor will work good enough with most. But the remaining 10% will need deeper digging into more data and correlations.
Root cause works better if you can come back next time the same thing happens and find a different root cause to fix. keep repeating until the problem doesn't happen enough to care anymore.
If the result/accident is too bad though you need to find all the different faults and mitigate as manyias possible the first time.
> Root cause works better if you can come back next time the same thing happens and find a different root cause to fix. keep repeating until the problem doesn't happen enough to care anymore.
This sounds like continuously firefighting to paper over symptoms rather than address the problems at a deeper level.
Sometimes, but sometimes you are allowed to discover and fix the real proplems which the never return.
I feel like half the time issues are caused by adding some stupid feature that nobody really wants, but makes it in anyways because the incentive is to add features, not make good software.
People rarely react well if you tell them "Hey this feature ticket you made is poorly conceived and will cause problems, can we just not do it?" It is easier just to implement whatever it is and deal with the fallout later.
It's hard to prove the cost to a feature or bug fix or library upgrade we desperately need has been doubled by all of the features we didn't need.
My 'favorite' is when we implement stupid, self-limiting, corner-painting features for a customer who leaves us anyway. Or who we never manage to make money from.
[dead]