Root Cause Analysis: When to Use 5-Whys vs Fishbone vs FTA

⚡ TL;DR

5-Whys for an event with one obvious symptom and a chain of cause-and-effect: a pump trip, a missed shutdown, a maintenance error. Fast, low-friction, lives or dies on whether you stop at the right level.

Fishbone (Ishikawa 6-M) for an event where the cause could plausibly live in several places and the team needs to brainstorm across categories: a recurring valve failure, a quality drift, an unexplained vibration. Wider net than 5-Whys, less depth per branch.

Fault Tree Analysis (FTA) for safety-critical events where you need to know combinations of events that could cause the top event, with logic gates and (optionally) quantified probability. Demand mode trips, loss-of-containment scenarios, anything regulators will read.

Why method selection matters

Most plants pick an RCA method by habit. The team has a 5-Whys template, so every event gets 5-Whys. Or the safety department mandated TapRoot ten years ago, so every event gets a TapRoot booklet whether it deserves one or not.

The cost of using the wrong method is rarely visible. You still produce a report, the actions still go on the register, the event still closes. What you don't see is the failure mode you missed because the method couldn't see it: 5-Whys won't surface a combination of causes; Fishbone won't help you stop at the right depth; FTA without quantification is theatre.

The good news is that the choice is mostly mechanical once you know the failure type. The next section gives you a decision tree.

The 90-second decision framework

Before opening the Toolbox, answer three questions about the failure event:

How serious was it (or could it have been)? Safety-critical or loss of containment → the analysis will end up in front of regulators or insurers. Operational only → the audience is the maintenance team.
How well do you understand the failure mechanism? If you can already write a chain of "because A, then B" sentences confidently, the cause structure is linear. If you find yourself listing candidates ("could be the sensor, could be the actuator, could be the procedure...") the structure is parallel.
Has this happened before? First occurrence → you're investigating one event. Third occurrence → you're investigating a systemic problem, and the cause categories matter more than any single chain.

Three questions, one method The decision usually clears itself in under two minutes. If the answers don't lead clearly to one method, that itself is a signal — start with Fishbone to map the candidate causes, then drill any promising branch with 5-Whys.

The three methods, side by side

Method 1

5-Whys

Ask "why did that happen?" five times (give or take). Each answer becomes the question for the next level. Stop when you reach a management-system or design cause that, if eliminated, would prevent the failure.

Best forLinear cause

Time per RCA15–45 min

StandardIEC 62740 Annex B

Method 2

Fishbone (Ishikawa 6-M)

List candidate causes under six standard categories: Man, Machine, Method, Material, Measurement, Environment. Each category becomes one rib of a fish skeleton aimed at the failure event.

Best forParallel cause

Time per RCA1–3 hours

StandardIEC 62740 Annex C

Method 3

Fault Tree Analysis (FTA)

Top event at the head, basic events at the leaves, AND/OR gates between. Logic shows which combinations of basic events produce the top event. Quantifiable with failure-rate data per cut set.

Best forCombinations

Time per RCA4–40 hours

StandardIEC 61025

What each method is good at

Capability	5-Whys	Fishbone	FTA
Speed (single analyst, no facilitation)	★★★	★★	★
Depth on one cause chain	★★★	★	★★★
Breadth across cause types	★	★★★	★★
Surfacing combinations of causes	★	★★	★★★
Quantified probability of root cause	—	—	★★★
Defensible to a regulator or insurer	★	★★	★★★
Workshop-friendly with mixed audience	★★	★★★	★
Captures non-technical (human / process) causes	★★	★★★	★

The Bluestream Toolbox ships 5-Whys and Fishbone in the Operate track today. FTA is on the near-term roadmap as a structured tree (top event, intermediate events, basic events, AND/OR markers). The data model accepts FTA imports already, so analyses you do today in another tool can land in the same report format later.

5-Whys: what it actually is

5-Whys started at Toyota in the 1930s. It survived because it's hard to do badly — or rather, the way you do it badly is obvious to anyone reading it. The point is not to ask exactly five questions; it's to drive the analysis past symptoms down to causes you can act on.

The discipline is in two places. First, stop when stopping makes sense, not when you hit five. If the second "why" already lands on a management-system root cause (no procedure existed for this scenario), don't pad to five. If the seventh "why" still hasn't reached anything controllable (the chemistry of metal fatigue), stop and accept the immediate cause as the actionable root.

Second, each "why" should follow from the previous answer factually, not by interpretation. "Why did the pump trip?" "Because the bearing temperature alarm activated." That's a factual chain. "Why did the pump trip?" "Because maintenance has been weak this year." That's a jump, not a chain — and it short-circuits the analysis.

When 5-Whys is the right call

You have one event, not a recurring pattern
The symptoms point clearly at one mechanism
The audience is the maintenance team, not regulators
You can defend the analysis in 1–2 pages
Time to closure matters — production wants the asset back

When 5-Whys is the wrong call

You're listing candidates rather than tracking a chain (use Fishbone)
Loss of containment, fatality potential, or regulatory exposure (use FTA)
Recurring failure — the chain is different each time (Fishbone first, then 5-Whys per branch)
The team strongly disagrees on the immediate cause — 5-Whys can't arbitrate, it can only follow consensus

Worked example: Feed pump PA-1003-A tripped on high vibration

5-Whys

One event, one symptom, factual chain available. Classic 5-Whys candidate.

Why 1: Pump tripped → Vibration sensor exceeded 7.1 mm/s for 10 s
Why 2: Vibration exceeded threshold → Coupling alignment had drifted to 0.4 mm parallel offset
Why 3: Coupling drifted → Foundation bolts on the motor side had loosened (torque check showed 60% of spec)
Why 4: Foundation bolts loosened → No PM existed to re-torque bolts on this skid
Why 5: No PM existed → The GMC for end-suction pumps assumed grouted skids; this one was bolt-mounted — concept variant was never created

Roots:

Immediate: Loose foundation bolts caused coupling misalignment, which produced vibration that exceeded the trip threshold.
Contributing: No periodic bolt-torque check was in the PM program for this asset.
Systemic: The GMC selection process does not check mounting type (grouted vs bolt-mounted) before assigning a maintenance concept.

Why this worked: The chain was factual at every step. We could have stopped at Why 4 ("missed PM") and acted, but Why 5 surfaced a system-level fix (update the GMC selection) that prevents the same gap on every bolt-mounted pump in the fleet.

Fishbone: the 6-M categories

Kaoru Ishikawa designed the fishbone diagram for Kawasaki Steel in the 1960s. The original was for quality engineering in manufacturing; the 6-M categories are the version that travelled into reliability and now sits in IEC 62740 Annex C as one of the recommended structured methods.

The six categories are deliberately broad. They're meant to force the team to consider cause types they wouldn't have thought of. When the team is wired to blame the equipment, "Method" makes them look at the procedure. When the team is wired to blame the operator, "Measurement" makes them look at the instrument that misled the operator.

Anatomy of a Fishbone (Ishikawa 6-M) Six standard cause categories branch off the spine, each carrying candidate causes for the team to consider. Done well, the value isn't a single cause — it's the convergence: causes from multiple categories pointing at the same upstream factor. In this example the test rig sits behind three of the six branches, even though no single category names it directly.

Category	What lives here	Typical industrial examples
Man	Human factors	Operator action, training gap, fatigue, shift handover, decision under time pressure, supervision
Machine	Equipment / tooling	Component wear, design limit, missing protection, calibration drift, undersized component, mismatched spare
Method	Procedure / process	Wrong sequence, missing step, ambiguous instruction, unsuitable for the actual context, deferred or skipped PM
Material	Inputs / spares	Wrong specification, contamination, defective batch, expired chemical, incompatible substitution, supplier QA gap
Measurement	Instrument / data	Sensor drift, missing alarm setpoint, wrong engineering units, blocked impulse line, DCS software bug, lost trend history
Environment	Operating context	Vibration source nearby, temperature swing, humidity, dust, salt spray, lightning, third-party damage

When Fishbone is the right call

Recurring failure — the chain is different each time, but the categories repeat
Multiple plausible causes, none obviously dominant
Workshop format with a mixed audience (engineering, operations, maintenance)
You need to make sure the team didn't overlook a whole category (the most common mode of failure in 5-Whys)

When Fishbone is the wrong call

The cause is genuinely linear and you'd be padding categories to look thorough
Safety-critical event where you need quantified combinations (use FTA)
Time pressure — a proper fishbone needs at least an hour of structured discussion
Single-analyst investigation — the whole value is in cross-functional brainstorm

Worked example: PSV-201 failed pop test three times in 12 months

Fishbone

Same valve, same failure mode (set pressure drift), three occurrences. Each individual event has its own 5-Whys story; what we need now is whether something connects them. Fishbone first.

Man: Test technician different each time; no formal calibration of test rig pressure ref
Machine: Spring is 17 years old, original spec; spring fatigue likely; bonnet seal looks ok
Method: Pop test procedure references 2009 OEM bulletin; new bulletin issued 2022 with revised test fluid; current procedure still uses 2009 fluid
Material: Test fluid stored on a heated rack — specification calls for ≤ 25 °C, rack runs at 38 °C
Measurement: Test rig pressure transducer last calibrated 2024-08; due 2025-02; overdue
Environment: Valve location exposed to direct sun in summer; thermal cycling 30 °C amplitude

What Fishbone surfaced that 5-Whys would have missed: three of the six branches (Method, Material, Measurement) all point at the test rig, not the valve. The convergent finding is that the test itself is unreliable — we don't actually know if PSV-201 is failing, or if the test that says it's failing is failing. The corrective actions split: replace the spring (Machine), update procedure to current OEM bulletin (Method), recalibrate the rig (Measurement), and fix the test fluid storage (Material).

Fault Tree Analysis: when only logic gates will do

FTA is the heaviest of the three. It's also the only one that can tell you which combinations of basic events produce the top event — and, with failure-rate data, how likely each combination is.

A fault tree starts with the top event (the failure you're investigating, or a hypothetical accident scenario). It's broken into intermediate events connected by AND/OR logic gates. An OR gate means any of the children can cause the parent; an AND gate means all children must occur together. Decomposition continues until you reach basic events — failures whose probability you can either measure (from OREDA, vendor MTBF, plant history) or assert from first principles.

The output is a list of minimal cut sets: the smallest combinations of basic events that produce the top event. A single basic event under an OR gate is a one-element cut set, which means it's a single point of failure. Cut sets with multiple basic events indicate that several things must go wrong simultaneously — usually less likely, but not always (common-cause failures defeat that assumption).

Anatomy of a fault tree Top event at the head. OR gates mean any child can cause the parent. AND gates mean all children must occur together. Basic events at the leaves carry probabilities. The smallest combinations producing the top event are the minimal cut sets — single-element cut sets are single points of failure.

When FTA is the right call

Loss of containment, fatality potential, or major environmental release
Demand-mode safety system — ESD, F&G, blowdown — failed or partially failed on demand
You need to demonstrate to a regulator that the system is acceptably safe (a one-page 5-Whys won't do)
You need quantified probability of the top event (with caveats — see below)
You suspect common-cause or common-mode failure across redundant trains

When FTA is the wrong call

You don't have failure-rate data for the basic events — an unquantified tree is documentation, not analysis
The failure mechanism is non-technical — human factors, procedural drift, and management-system causes resist boolean modelling
Time pressure — even a small fault tree is a one-day exercise minimum
The audience is operational, not regulatory — you'll spend more time explaining the tree than they'll spend reading the actions

FTA's weakness: the tree is only as good as its scope. Anything you didn't put in the tree, it can't surface. Combine FTA (to map known failure paths) with Fishbone (to make sure you haven't missed an entire category of cause) on anything important.

Combining methods in one investigation

A common pattern for non-trivial events is to chain methods rather than pick one. The hierarchy that works in practice:

Fishbone first to map the candidate cause space without committing. Cheap insurance against missing a whole category.
5-Whys per promising branch to drive the strongest candidates to actionable depth.
FTA on the critical sub-path if any branch involves a safety system, redundancy, or a combination that warrants quantification.

The Bluestream Toolbox supports this workflow directly. Run a Fishbone, identify the strong category, switch to 5-Whys for that branch, and both outputs land in the same RCA report. Push the immediate cause into FMECA from either method — the Develop/Operate loop closes the same way regardless of which method produced the finding.

Common mistakes (in all three methods)

Stopping at "operator error". The most common 5-Whys failure mode. "Operator error" is not a root cause; it's the place where the investigation should start. Why did the operator do that? What did the procedure say? What did the HMI show? Push past human action to the system that produced the action.
Padding to look thorough. Six bullets per Fishbone category when only two are real. Five whys when three would have ended at a system cause. Length is not depth.
Confusing chronology with causation. "The alarm sounded, then the pump tripped, therefore the alarm caused the trip" — classic post hoc. Both could have a common upstream cause.
Calling the immediate cause the root cause. A bearing failed. Why was that bearing in service longer than its rated life? Why was no condition monitoring detecting the wear? The bearing is the immediate cause. The system gap is the root cause.
Quantifying a fault tree without data. Made-up failure rates produce decision-grade numbers from non-decision-grade inputs. If you don't have OREDA, vendor data, or plant history, document the qualitative cut sets and stop there.
Letting the loudest voice drive the analysis. The whole point of a structured method is to slow down the team's first hypothesis. If the room has settled on the cause before the analysis starts, the method is decoration.
Findings without actions or owners. An RCA without specific, owned, dated corrective and preventive actions is documentation. It does not change anything in the maintenance program.

Closing the loop: from RCA back to the program

The point of every RCA is to change something in the program so the same failure doesn't reach the same conclusion next time. The Bluestream Operate track makes this explicit. Every RCA result has two write-back buttons:

Add as failure mode to FMECA. The immediate cause becomes a new row in the FMECA for that equipment class. Future analyses inherit it. See the FMECA guide's RCA loopback section for the full mechanism.
Revise RCM strategy. If the systemic cause says "the PM was wrong" or "the strategy should have been on-condition, not calendar", you go straight to the RCM decision tree for that failure mode and re-derive the task. See the RCM guide's RCA loopback section.

An RCA that doesn't close the loop is half done. Pick the method that fits the failure, run it properly, then push the finding back into the FMECA and RCM that produced the original program. That is what makes the next instance of the failure rare instead of inevitable.

Upstream of RCA: the Bad Actor / Pareto tool in the Operate track decides which RCA to run first. Drop in a D365 Asset fault cost-control export, pick an axis (Asset / Model / Cause), and the tool ranks the top offenders by cost. One click on a row pre-populates the RCA tool with that asset's context and switches you to it. So the full loop is: failure history → Bad Actor ranking → RCA on the top offender → FMECA / RCM write-back.

References

IEC 62740:2015, Root cause analysis (RCA). The methodology and reporting standard. Annex B covers chain-style methods (5-Whys family); Annex C covers categorical methods (Fishbone family).
IEC 61025:2006, Fault tree analysis (FTA). The international standard for tree construction, gate symbology, cut-set analysis, and quantification.
ISO 14224:2016, particularly Annex C.3 (failure mechanism and cause taxonomy). The Toolbox uses this taxonomy as a vocabulary prior in every RCA prompt so candidate causes use industry language.
SAE JA1011 / JA1012, the RCM standards. Define what makes a failure mode worth controlling and how to choose a maintenance task — the destination for RCA findings via the write-back loop.
CCPS Guidelines for Investigating Process Safety Incidents (2003, 2nd ed.). The chemical-process industry's deep reference on incident investigation, complementing IEC 62740 for major-accident events.
Andersen, B. & Fagerhaug, T. (2006), Root Cause Analysis: Simplified Tools and Techniques, ASQ Quality Press. The classic working-engineer reference on method selection and facilitation.