Services How We Work Sectors Academy Library About Contact Open the Toolbox →
Operate Track · Tool 01 Guide

Root Cause Analysis: When to use 5-Whys, Fishbone, or Fault Tree

Three methods, one decision: which one fits the failure in front of you? This guide is the difference between an RCA that genuinely changes the maintenance program and one that fills out a form. We cover what each method actually is, when to reach for it, when to walk past it, and how to combine them when a single method isn't enough.

IEC 62740 IEC 61025 ISO 14224 §C.3 CCPS Guidelines
⚡ TL;DR

5-Whys for an event with one obvious symptom and a chain of cause-and-effect: a pump trip, a missed shutdown, a maintenance error. Fast, low-friction, lives or dies on whether you stop at the right level.

Fishbone (Ishikawa 6-M) for an event where the cause could plausibly live in several places and the team needs to brainstorm across categories: a recurring valve failure, a quality drift, an unexplained vibration. Wider net than 5-Whys, less depth per branch.

Fault Tree Analysis (FTA) for safety-critical events where you need to know combinations of events that could cause the top event, with logic gates and (optionally) quantified probability. Demand mode trips, loss-of-containment scenarios, anything regulators will read.

Why method selection matters

Most plants pick an RCA method by habit. The team has a 5-Whys template, so every event gets 5-Whys. Or the safety department mandated TapRoot ten years ago, so every event gets a TapRoot booklet whether it deserves one or not.

The cost of using the wrong method is rarely visible. You still produce a report, the actions still go on the register, the event still closes. What you don't see is the failure mode you missed because the method couldn't see it: 5-Whys won't surface a combination of causes; Fishbone won't help you stop at the right depth; FTA without quantification is theatre.

The good news is that the choice is mostly mechanical once you know the failure type. The next section gives you a decision tree.

The 90-second decision framework

Before opening the Toolbox, answer three questions about the failure event:

  1. How serious was it (or could it have been)? Safety-critical or loss of containment → the analysis will end up in front of regulators or insurers. Operational only → the audience is the maintenance team.
  2. How well do you understand the failure mechanism? If you can already write a chain of "because A, then B" sentences confidently, the cause structure is linear. If you find yourself listing candidates ("could be the sensor, could be the actuator, could be the procedure...") the structure is parallel.
  3. Has this happened before? First occurrence → you're investigating one event. Third occurrence → you're investigating a systemic problem, and the cause categories matter more than any single chain.
RCA method selection decision tree A decision tree starting from a failure event. The first branch asks whether the event is safety-critical or has loss-of-containment potential. If yes, the path leads to Fault Tree Analysis. If no, the next question asks whether the cause structure feels linear or parallel. Linear paths lead to 5-Whys; parallel paths lead to Fishbone. A side branch suggests starting with Fishbone for recurring failures and following up with 5-Whys on each candidate branch. FAILURE EVENT Start here Safety-critical? Loss of containment, fatality potential, regulatory event FAULT TREE Quantify combinations, defensible to regulators Cause structure: linear or parallel? Can you write "because A, then B" or are you listing candidates? 5-WHYS Linear chain, fast, depth over breadth FISHBONE 6-M categories, breadth over depth RECURRING? Fishbone first, then 5-Whys per branch Yes No Linear Parallel Linear and parallel are not exclusive — combining methods is normal (see § Combining methods)
Three questions, one method The decision usually clears itself in under two minutes. If the answers don't lead clearly to one method, that itself is a signal — start with Fishbone to map the candidate causes, then drill any promising branch with 5-Whys.

The three methods, side by side

Method 1

5-Whys

Ask "why did that happen?" five times (give or take). Each answer becomes the question for the next level. Stop when you reach a management-system or design cause that, if eliminated, would prevent the failure.

Best forLinear cause
Time per RCA15–45 min
StandardIEC 62740 Annex B
Method 2

Fishbone (Ishikawa 6-M)

List candidate causes under six standard categories: Man, Machine, Method, Material, Measurement, Environment. Each category becomes one rib of a fish skeleton aimed at the failure event.

Best forParallel cause
Time per RCA1–3 hours
StandardIEC 62740 Annex C
Method 3

Fault Tree Analysis (FTA)

Top event at the head, basic events at the leaves, AND/OR gates between. Logic shows which combinations of basic events produce the top event. Quantifiable with failure-rate data per cut set.

Best forCombinations
Time per RCA4–40 hours
StandardIEC 61025

What each method is good at

Capability5-WhysFishboneFTA
Speed (single analyst, no facilitation)★★★★★
Depth on one cause chain★★★★★★
Breadth across cause types★★★★★
Surfacing combinations of causes★★★★★
Quantified probability of root cause★★★
Defensible to a regulator or insurer★★★★★
Workshop-friendly with mixed audience★★★★★
Captures non-technical (human / process) causes★★★★★

The Bluestream Toolbox ships 5-Whys and Fishbone in the Operate track today. FTA is on the near-term roadmap as a structured tree (top event, intermediate events, basic events, AND/OR markers). The data model accepts FTA imports already, so analyses you do today in another tool can land in the same report format later.

5-Whys: what it actually is

5-Whys started at Toyota in the 1930s. It survived because it's hard to do badly — or rather, the way you do it badly is obvious to anyone reading it. The point is not to ask exactly five questions; it's to drive the analysis past symptoms down to causes you can act on.

The discipline is in two places. First, stop when stopping makes sense, not when you hit five. If the second "why" already lands on a management-system root cause (no procedure existed for this scenario), don't pad to five. If the seventh "why" still hasn't reached anything controllable (the chemistry of metal fatigue), stop and accept the immediate cause as the actionable root.

Second, each "why" should follow from the previous answer factually, not by interpretation. "Why did the pump trip?" "Because the bearing temperature alarm activated." That's a factual chain. "Why did the pump trip?" "Because maintenance has been weak this year." That's a jump, not a chain — and it short-circuits the analysis.

When 5-Whys is the right call

When 5-Whys is the wrong call

Worked example: Feed pump PA-1003-A tripped on high vibration

5-Whys

One event, one symptom, factual chain available. Classic 5-Whys candidate.

Why 1: Pump tripped → Vibration sensor exceeded 7.1 mm/s for 10 s
Why 2: Vibration exceeded threshold → Coupling alignment had drifted to 0.4 mm parallel offset
Why 3: Coupling drifted → Foundation bolts on the motor side had loosened (torque check showed 60% of spec)
Why 4: Foundation bolts loosened → No PM existed to re-torque bolts on this skid
Why 5: No PM existed → The GMC for end-suction pumps assumed grouted skids; this one was bolt-mounted — concept variant was never created

Roots:

Immediate: Loose foundation bolts caused coupling misalignment, which produced vibration that exceeded the trip threshold.
Contributing: No periodic bolt-torque check was in the PM program for this asset.
Systemic: The GMC selection process does not check mounting type (grouted vs bolt-mounted) before assigning a maintenance concept.

Why this worked: The chain was factual at every step. We could have stopped at Why 4 ("missed PM") and acted, but Why 5 surfaced a system-level fix (update the GMC selection) that prevents the same gap on every bolt-mounted pump in the fleet.

Fishbone: the 6-M categories

Kaoru Ishikawa designed the fishbone diagram for Kawasaki Steel in the 1960s. The original was for quality engineering in manufacturing; the 6-M categories are the version that travelled into reliability and now sits in IEC 62740 Annex C as one of the recommended structured methods.

The six categories are deliberately broad. They're meant to force the team to consider cause types they wouldn't have thought of. When the team is wired to blame the equipment, "Method" makes them look at the procedure. When the team is wired to blame the operator, "Measurement" makes them look at the instrument that misled the operator.

CategoryWhat lives hereTypical industrial examples
ManHuman factorsOperator action, training gap, fatigue, shift handover, decision under time pressure, supervision
MachineEquipment / toolingComponent wear, design limit, missing protection, calibration drift, undersized component, mismatched spare
MethodProcedure / processWrong sequence, missing step, ambiguous instruction, unsuitable for the actual context, deferred or skipped PM
MaterialInputs / sparesWrong specification, contamination, defective batch, expired chemical, incompatible substitution, supplier QA gap
MeasurementInstrument / dataSensor drift, missing alarm setpoint, wrong engineering units, blocked impulse line, DCS software bug, lost trend history
EnvironmentOperating contextVibration source nearby, temperature swing, humidity, dust, salt spray, lightning, third-party damage

When Fishbone is the right call

When Fishbone is the wrong call

Worked example: PSV-201 failed pop test three times in 12 months

Fishbone

Same valve, same failure mode (set pressure drift), three occurrences. Each individual event has its own 5-Whys story; what we need now is whether something connects them. Fishbone first.

Man: Test technician different each time; no formal calibration of test rig pressure ref
Machine: Spring is 17 years old, original spec; spring fatigue likely; bonnet seal looks ok
Method: Pop test procedure references 2009 OEM bulletin; new bulletin issued 2022 with revised test fluid; current procedure still uses 2009 fluid
Material: Test fluid stored on a heated rack — specification calls for ≤ 25 °C, rack runs at 38 °C
Measurement: Test rig pressure transducer last calibrated 2024-08; due 2025-02; overdue
Environment: Valve location exposed to direct sun in summer; thermal cycling 30 °C amplitude

What Fishbone surfaced that 5-Whys would have missed: three of the six branches (Method, Material, Measurement) all point at the test rig, not the valve. The convergent finding is that the test itself is unreliable — we don't actually know if PSV-201 is failing, or if the test that says it's failing is failing. The corrective actions split: replace the spring (Machine), update procedure to current OEM bulletin (Method), recalibrate the rig (Measurement), and fix the test fluid storage (Material).

Fault Tree Analysis: when only logic gates will do

FTA is the heaviest of the three. It's also the only one that can tell you which combinations of basic events produce the top event — and, with failure-rate data, how likely each combination is.

A fault tree starts with the top event (the failure you're investigating, or a hypothetical accident scenario). It's broken into intermediate events connected by AND/OR logic gates. An OR gate means any of the children can cause the parent; an AND gate means all children must occur together. Decomposition continues until you reach basic events — failures whose probability you can either measure (from OREDA, vendor MTBF, plant history) or assert from first principles.

The output is a list of minimal cut sets: the smallest combinations of basic events that produce the top event. A single basic event under an OR gate is a one-element cut set, which means it's a single point of failure. Cut sets with multiple basic events indicate that several things must go wrong simultaneously — usually less likely, but not always (common-cause failures defeat that assumption).

Simple fault tree example A fault tree with the top event "Loss of cooling water supply" connected via an OR gate to two intermediate events: "Both pumps fail" (under an AND gate, with basic events "Pump A fails" and "Pump B fails") and "Power loss to skid" (under an OR gate, with basic events "Mains fault" and "UPS fault"). TOP EVENT Loss of cooling water supply OR INTERMEDIATE Both pumps fail AND Pump A fails P = 1.2e-3 /h Pump B fails P = 1.2e-3 /h INTERMEDIATE Power loss to skid OR Mains fault P = 5e-5 /h UPS fault P = 8e-5 /h Cut sets: {Pump A · Pump B}, {Mains}, {UPS} — the two power-side cut sets are single-point failures
Anatomy of a fault tree Top event at the head. OR gates mean any child can cause the parent. AND gates mean all children must occur together. Basic events at the leaves carry probabilities. The smallest combinations producing the top event are the minimal cut sets — single-element cut sets are single points of failure.

When FTA is the right call

When FTA is the wrong call

FTA's weakness: the tree is only as good as its scope. Anything you didn't put in the tree, it can't surface. Combine FTA (to map known failure paths) with Fishbone (to make sure you haven't missed an entire category of cause) on anything important.

Combining methods in one investigation

A common pattern for non-trivial events is to chain methods rather than pick one. The hierarchy that works in practice:

  1. Fishbone first to map the candidate cause space without committing. Cheap insurance against missing a whole category.
  2. 5-Whys per promising branch to drive the strongest candidates to actionable depth.
  3. FTA on the critical sub-path if any branch involves a safety system, redundancy, or a combination that warrants quantification.

The Bluestream Toolbox supports this workflow directly. Run a Fishbone, identify the strong category, switch to 5-Whys for that branch, and both outputs land in the same RCA report. Push the immediate cause into FMECA from either method — the Develop/Operate loop closes the same way regardless of which method produced the finding.

Common mistakes (in all three methods)

  1. Stopping at "operator error". The most common 5-Whys failure mode. "Operator error" is not a root cause; it's the place where the investigation should start. Why did the operator do that? What did the procedure say? What did the HMI show? Push past human action to the system that produced the action.
  2. Padding to look thorough. Six bullets per Fishbone category when only two are real. Five whys when three would have ended at a system cause. Length is not depth.
  3. Confusing chronology with causation. "The alarm sounded, then the pump tripped, therefore the alarm caused the trip" — classic post hoc. Both could have a common upstream cause.
  4. Calling the immediate cause the root cause. A bearing failed. Why was that bearing in service longer than its rated life? Why was no condition monitoring detecting the wear? The bearing is the immediate cause. The system gap is the root cause.
  5. Quantifying a fault tree without data. Made-up failure rates produce decision-grade numbers from non-decision-grade inputs. If you don't have OREDA, vendor data, or plant history, document the qualitative cut sets and stop there.
  6. Letting the loudest voice drive the analysis. The whole point of a structured method is to slow down the team's first hypothesis. If the room has settled on the cause before the analysis starts, the method is decoration.
  7. Findings without actions or owners. An RCA without specific, owned, dated corrective and preventive actions is documentation. It does not change anything in the maintenance program.

Closing the loop: from RCA back to the program

The point of every RCA is to change something in the program so the same failure doesn't reach the same conclusion next time. The Bluestream Operate track makes this explicit. Every RCA result has two write-back buttons:

An RCA that doesn't close the loop is half done. Pick the method that fits the failure, run it properly, then push the finding back into the FMECA and RCM that produced the original program. That is what makes the next instance of the failure rare instead of inevitable.

References

Next steps