Incident Investigations and Root Cause Analyses: Tips from a Pro


Incident Investigations & Root Cause Analysis Image

In this article, we've got an interview with Joey Estey talking about performing incident investigations at work and looking for a root cause (or root causes).

You may already know this is the second of two articles drawn from an interview about incident investigations with Joe Estey. The earlier article is more of an introduction to incident investigations, including what an incident investigation is, when and how to do one, pros and cons of doing your own as opposed to bringing in a third-party, and some common mistakes people make while performing incident investigations.

In addition, we've also recently spoken with Joe about performing pre-task pre-mortems, discussions at the job scene intended to help avoid having incidents and having to perform incident investigations.

And with those three points made, let's see what Joe had to tell us about incident investigations and performing a root-cause analysis.

Incident Investigations and Performing a Root-Cause Analysis

[For those interested in a short introduction to incident investigations, check the short sample below from our online incident investigation training course.]

And now let's jump right into the questions with Joe.

Root Cause--Definition and Relation to Incident Investigations

Question: Can you start by telling us what is a root-cause analysis and how is it related to an incident investigation?

OK, great. Normally, instead of root-cause analyses, we call them causal evaluations or causal analyses because a root-cause analysis is a very specific type of analysis. So, a causal evaluation is a level of effort used to understand why the actual outcome you got was different than the expectation.

So, all problems can be summarized as "what you wanted" vs. "what you got," and there can be a very big gap or it can be a very small gap. And so the level of effort has to be right to understand why that gap happened.

So, in regulatory communities, like NERC (for power) and the Department of Energy, they have defined two types of analyses. There's an apparent-cause analysis, and a root-cause analysis. And they did that because they're using taxpayer money, or rate-payer's money, in order to study them, and they need a rationale for why you spent so much time studying the problem. And it usually has to be with consequence: you know, did you kill someone, did you blow up a sub-station, that kind of thing.

Whereas, in the private sector, where they want to be getting, let's stay, 8.5 million square feet of cardboard out in a day, and yet they can only get 5 million out, they are not driven by the same regulatory framework, they just want to know why aren't they getting what they want to be getting, or getting the results that they want.

So in that case, they do a causal analysis, and it starts with, number one, the problem statement, what's the REAL issue rather than the one that people may think is the issue? So number one is always, start with a draft problem statement--tell me what you think is the problem. And that's usually defined as what you wanted vs. what you're getting.

Next in order is give me the evidence that it's a problem. What are the indicators? Specifically, are you slowing your process, are you damaging more product, are you hurting people, are you nearly hurting people, do subcontractors come on site and not know what to do when they start work--whatever the indicators are.

Now, even when someone says something subjective, like "Well, our people aren't working as safely as they can be," after you define the problem statement, you still have to ask, "Can you show me that? Where specifically is that evidence that suggests that what you're telling me is correct?"

And then, the next one is, impact. OK, if all that is true, then what's the consequence most likely for the organization?

So it's the issue, the indicator, and the impact.

And then for the analysis, there are a few tools.

The most common tool used for the analysis is The Five Whys? However, most causal analysis experts don't rely on the 5 Whys? unless it is the lowest level of incident possible in terms of consequence, because it starts on one landing, like "Why did the machine do the thing that you didn't want it to do?" and then you answer that, and then you answer that, and just keep asking why, but you're only going down one stairway. The problem with that is you're only using a single why to answer the one above, and so you only have one plausible conclusion in the end. And most incidents aren't a single point of failure. They're multi-factorial--we say there's a confluence of causes, like three rivers flowing together, rather than a single stream. So if you only use The 5 Whys? process, which is what some of the more "quote-unquote" energy efficient investigators use, you only get to one answer. "The person was wearing inappropriate PPE." Why? "Because they didn't know how to determine the correct from the incorrect." Why? "Because their training didn't cover it. "Oh, OK, so we have to train everyone."

But it could be that maybe the best idea is to sort that PPE before it comes on site, so that you can't--even without training--pick the wrong one. And there could be a variety of answers, besides just training everyone. And you won't know that if all you do is use The 5 Whys?

5 Whys and Alternatives

Question: What's an alternative to using The 5 Whys, then?

Well, a lot of analysts use The Why Tree rather than the Why Staircase (from The 5 Whys?), and it goes by the names Cause Mapping, Latent Causal Analysis, there are many different names for it. A lot of people have put proprietary pegs onto some of them, but really they boil down to this: your problem statement is placed at the top of the tree, and that's the "fruit" that you are studying. And then you break down some legs, like a fault line, or fault tree, where you ask "What action was going on at the time?" and "What condition was in place at the time?" And then you ask "And what caused that?" or "And why was that?" And as you go down those legs, asking "And why was that?" and "What caused that?", at the very bottom, those blocks you get, are the most plausible reason for why this event happened.

Question: Any more?

The only other one is that a barrier analysis is often used.

A barrier analysis is commonly used by people who are trained in it, and what it does is this. Every job you ever do, and every system you put in place, has some kind of controls written in to it to prevent the worst thing from happening, be it operator training, quality control methodologies, indicator there are always some barriers put in place.

A barrier analysis is what you use to better understand what barriers in the process failed to allow the event to happen. So, in The Why Tree, you get actions and conditions, and in the Barrier Analysis, you get where the barriers failed.

Question: In a barrier analysis, when you are looking for barriers that were set up as controls but failed, I'm assuming you also look for barriers that should have been present but were absent. Is that true as well?

That's a good point, but here's the rule behind that. The barriers that you can analyze after an incident are only those that were required to be used during the activity. They should never be recommended, because now you THINK you know why it happened.

So, let's say you didn't make a policy about weight limit, and a person strained their back lifting 55 pounds awkwardly when they're allowed to lift 75 pounds on their own, if "quote-unquote" they're careful.

So your policy is, "as an individual, you can lift 75 pounds on your own if you're careful."

Well, someone wrenches their back, and then an investigator says "Well, you know what I would have done, is I would have asked for help, because..."

Well, you can't ask that. What was the policy as written at the time? Well, it was 75 pounds, so that's the only thing you can look at. You can't look at what somebody suggests because you can never assess the quality of a barrier that was never required to be in place in the first place.

Compliance & Company Procedures/Policy

Question: Understood. So, by "required to be in place," are you talking specifically about employer policy, or could that also include regulatory compliance, national standards, and best practices, or is that getting too far afield?

No, no, great point.

And it would be the flow-down of those standards into the company's own policies, but normally what we ask is "When you tell me that was a required barrier, where would I find that in your policy, your procedure, or in a practice taught to your people in your training?"

So if it's not in a policy, if it's not in a procedure, they had to get it from somewhere, and the only thing left is you taught them that, because it's not in the other two. And that doesn't mean you have to teach them, if they were an IBEW-qualified electrician, you don't have to train them everything they had to know about electrical safety when you hired them, there's an expectation that they should have gotten that back at the hall. But where? Where specifically would they have been taught that?

Tips for Cutting Down Incidents

Question: Last question about the cause of workplace incidents. Based on your own experiences, what are some of the most common things that companies can do to cut down on incidents and the need for incident investigations?

Oh, that's a great one.

One of them is to, number one, do a pre-mortem, so you don't have to do a postmortem. Postmortem is after the event. A pre-mortem is a gathering of the people going off to do a job, and talking through the job in a dialogue rather than in a monologue fashion.

So, the question would be "Hey, we're going out to change some bearings in this generator or this housing, so what are some things that we especially have to pay attention to?" And if they say "Well, you know, we just need to be careful," that's not enough.

They need to probe deeply with investigative questions like the four we usually teach: number one, identify the critical steps; number two, identify the mistakes that are most likely made during the most critical steps; number three, talk about the consequences of those mistakes being made; and the biggest one is, number four, what defenses do you currently plan to put in place in order to keep those errors from happening?

And rather than just the general "Hey, are you guys all good with the job, is everyone comfortable?", the pre-mortems are really an effective way of getting people to think on the job rather than just mentally mailing their performance in. So that's one thing they can do.

Question: And again in your own your experience, what are some of the most common causes of incidents that you see out there? Presumably that information can be used pro-actively to help avoid future incidents as well.

Great question. We know from the roll-up of reports at OSHA, NERC, and DOE that when you get the reports in about why incidents happen, 80% of incidents are caused "quote-unquote" by human error; and 20% are caused by genuine equipment failure.

BUT, if you look at the subset of human error, 70-80% of human error have been organizationally induced. Only 20-30% are caused directly by the individual.

And some of the biggest causes of what we call organizationally induced errors are, a big one I see a lot, is "Yellow Lights Bias," which is, we tell them what "Green" means, and we tell them what "Red" means, but we don't tell them what "Yellow" means and we expect them to figure that out for themselves. So, when they come to an intersection at the road, sometimes they stop, sometimes they go. That yellow light really just tells me I have to do something, but it doesn't tell me what I have to do.

So in most businesses--I studied 11 different incidents at a site in Colorado, all relating as a common cause to the Yellow Light Bias, which was in their job hazard analysis, they identified the scope, they identified the specific hazard, and then in the control it said "as needed," as required," "as appropriate," or "as applicable." Those mean nothing. And so what that means is we hope you know what you're doing--there's a lot of fuzz, and fuzz is the number one cause of accidents.

So what some organizations do is they do what we call a Yellow Lights Scrub. They take their procedures, their policies, and everywhere it says "as needed" or "as required" and they actually define the specific standard--because how can you get it wrong if you know specifically what to do?

Question: OK, good, so yellow light for you, and in my language fuzz. What's another common cause of the organizationally induced human error?

Another would be the assignment of people, not rotating them in and out of different positions to cross-train them, because it leads to--I hate to use the word "complacency," because complacency has many different faces, but what happens is, it leads to people getting conditioned to do the work, and we get so good at that job after the first couple of days we've learned it, that we no longer consider it possible to be hurt. And the organization doesn't put controls in place that require me to really exercise my caution.

Let me give you a good example. We had a meter reader at an event yesterday, who was brand new to the job. He had been a meter reader for a week, and had been an electrician for ten years. He was with an experienced meter reader who has been one for 20 years, so the senior meter technician is having the junior meter technician lift the terminal leads on a relay component that has to remain energized while they're doing the work. It's a low-energy component but on a high-energy system, so bad things can still happen obviously. And the experienced meter reader who has the print says "Lift F5," and the junior meter reader says "I'm lifting F5," and then he took off F6. He had to crouch down when he's doing this to look at the undercarriage of the relay, but he's in an awkward position without much visibility. When he went to test the relay, it tripped the entire system offline, and 100 different buildings lost power for about an hour. It was kind of a big deal.

Now the reason I bring that up is, here's what's interesting. The organization said "Well, if we would have known we had a new guy on it, we would have written the work package differently, requiring a secondary peer-check of all the leads that were lifted and taped before we did the test." And I asked them "Why would you do that, knowing the guy was new?" And they answered, "Well, because it was written for two senior meter technicians, who wouldn't usually make that mistake." Well, I don't know if you can see an error in that logic, but the two senior technicians can just as easily do the wrong thing as somebody who's new, but for an entirely different reason.

Question: Right, it may even be more likely, since automaticity for the senior techs has already kicked in.

There, you got it. And so the answer isn't to write work packages to address individual workers. You can't do that. IF you have the luxury of knowing who exactly is going to show up to do the job that day, MAYBE you can write packages that way.

So they were kind of going down the wrong corrective action path, because the truth is, and what made sense to them was my explanation, I said "If you have two surgeons, one who is new and one experienced, or two pilots, one experienced and one not experienced, you don't develop different checklists for them based on experience level, because you expect peak performance out of everyone in your organization, so you don't modify it because well, you two are senior pilots, you don't have to use your checklist on takeoff because you know what you're doing."

You know, the record shows that people who are very experienced most often get in trouble, because of automaticity, we get very good at what we're doing until we don't do it well.

Question: Well said. Given that, that's the last of my prepared questions for you, but it does raise one unplanned question. If I recall, 80% of the incidents are attributable to human error, but roughly 80% of those are attributable to organizationally induced human error. What is your feeling about behavior-based safety training, which SOME people feel is a way to shift blame to the employees?

That's a great question.

I actually taught behavior-based safety (BBS) for about six years. And, at the time, it seemed like the appropriate thing to do, because before that, it was all about the worker being wrong, and we just have to produce perfect workers.

And then we realized that people aren't perfect, and good luck on doing that. And no matter how much we try to put "perfection" into our training, into our systems, we are going to have to live and compensate for error. And then I was introduced to the concept of human performance improvement (HPI), which was being used by a lot of companies who are high-reliability organizations--they cannot afford to get it wrong, because if they do, really bad things are going to happen. And so they kind of evolved through the BBS process to HPI, and HPI was basically understanding that humans will be humans, and it's far easier to change the conditions humans will operate under, rather than changing the human.

Behavior-based safety sometimes wants us to change the human. The other alternative is HPI. I always use the example of the ATM machine.

If I walk up to an ATM and I don't want to leave my debit card behind, some banks have not made that easy, because they give me my cash first and then my card later. Well, once I get my cash, I may forget my card. So a behavioral approach would be putting a sign up that says "Hey, remember your card." Well, that's not going to help. I don't read signs as a human being. Another would be to have someone behind me, and do a peer check, telling me to remember my card, but how expensive is that going to be for an organization? OR, you can simply give me my card first, because I'm going to stick around for my cash.

Question: How can people contact you if they want help with incident investigations, or to speak at a conference? And can you tell us more about the conference you recently on Human Performance Improvement (HPI) if I have that correct?

Thank you very much, Jeff, I appreciate that. That's right, it was in Ontario, we had a recent conference in Toronto that I spoke at four times. It's easiest to get a hold of me via email (just click that link), and you can also find me on LinkedIn at Joe Estey. I work for Lucas Engineering in Management Support, and I am one of six members on the national board for Human Performance, Root Cause, and Trending Association, which is an international association dedicated to incident investigation, analysis, corrective actions, and trending of those corrective actions for improvement. And any opportunity I have to help an organization out, I welcome, because it's always a learning opportunity.

Conclusion: Incident Investigations & Root-Cause Analyses--Tips from Joe

We'd like to thank Joe Estey for his time, knowledge, and insights on explaining more about incident investigations, including what an incident investigation is, how and when to perform one, who to involve in performing an incident investigation, some common errors that occur during incident investigations, and more.

If you're looking for help managing your incident investigation process, you may want to learn about our new incident management software, which can be used as a stand-alone or integrated with our Convergence learning management system. We've got a short video overview of the IMS for you below.

To help train your employees about incident investigations, you may find our online incident investigation training course helpful. We've included a short sample video below.

And for EVEN MORE about incident investigations, you may find these articles helpful:

Here's a little more about Joe:

As Principal Performance Improvement Specialist for Lucas Engineering and Management Solutions, Joe Estey mentors and trains executives, managers and front line workers from a variety of industries on Human Performance Improvement and Leadership. Clients include national research and development laboratories, manufacturing plants, construction and demolition sites and one of a kind/first of their kind production facilities. As the recipient of three National Awards from the White House Executive Leadership council for his work in public outreach and education, he frequently speaks to public agencies, corporate and small businesses across North America.

Serving as one of six members on the National Board of Directors for the Human Performance Root Cause Trending Organization (HPRCT.ORG), Joe works with principal investigators, managers and analysts from fields as varied as aviation, pharmaceutical, medical, manufacturing and power generation to implement best management practices for reducing the frequency and severity of human error.

His book, The Tomorrow Tapestry, Life Woven on the Fabric of Change, was one of Publish America’s top ten business books in 2008 and has been used in leadership and organizational development courses throughout North America.

Remember that we published an earlier article from this same interview, and that it serves as an introduction to incident investigations. Plus, we've got yet another article that explains how to perform an incident investigation.

Want to Know More?

Reach out and a Vector Solutions representative will respond back to help answer any questions you might have.