The RCM/RBM tool takes the failure modes you identified in FMECA (Step 2) and assigns each one a maintenance strategy — on-condition monitoring, scheduled restoration, scheduled discard, failure-finding, run-to-failure, or redesign — following the decision logic set out in SAE JA1011/JA1012 and referenced by NORSOK Z-008:2024.
You don't pick the strategy. You answer questions about each failure mode — is it evident or hidden, is a task technically feasible, is it worth doing — and the tree derives the strategy. If no task works and the failure is safety-critical, the output is compulsory redesign. That's the standard, and the tool enforces it.
What is Reliability-Centred Maintenance?
Reliability-Centred Maintenance is a structured way of answering a single question for every way a piece of equipment can fail: what is the most cost-effective thing we can do to manage this failure?
Before RCM, maintenance programs were built on two assumptions. First, that every piece of equipment has an identifiable "right" age at which it should be overhauled or replaced. Second, that more preventive maintenance always reduces failure risk. Both turned out to be wrong for most equipment. Studies in the airline industry during the 1960s — the origin of RCM — found that for the majority of failure modes, scheduled overhauls did nothing to improve reliability, and in some cases made it worse by introducing errors during reassembly.
RCM reframed the problem. Instead of asking "how often should we service this?", it asks:
- What can go wrong? (failure modes — handled in FMECA, Step 2)
- What happens if it does? (consequences)
- Is there a task that prevents or predicts the failure? (task feasibility)
- Is the task worth doing? (cost-benefit)
- If no task is feasible — what then? (run-to-failure or redesign)
The answers produce a maintenance program that is targeted, justified, and defensible. That's what this tool does.
RCM vs RBM: Reliability-Centred Maintenance focuses on task selection for each failure mode. Risk-Based Maintenance is a prioritisation framework that ranks assets or failure modes by risk (probability × consequence) to allocate inspection and maintenance effort. The Bluestream tool implements RCM task-selection logic; the consequence inputs from Step 1 provide the risk dimension for prioritisation. In practice, the two approaches are complementary and this tool supports both workflows.
The standards behind the tool
Four standards shape the RCM/RBM tool. Each contributes something specific; none covers everything.
| Standard | What it contributes |
|---|---|
| NORSOK Z-008:2024 | The Norwegian petroleum industry standard for risk-based maintenance and consequence classification. Defines the framework: consequence classes drive task-selection rigour, Generic Maintenance Concepts are the preferred starting point, and tasks must be technically feasible and cost-effective. Z-008 does not prescribe a specific decision tree — it points outward to RCM methodology. |
| SAE JA1011 & JA1012 | The authoritative RCM standards. JA1011 defines the seven questions a valid RCM process must answer. JA1012 is the implementation guide with the full decision logic — the actual tree this tool walks. If you want to claim "standards-compliant RCM", you cite JA1011. |
| IEC 60300-3-11 | The international (IEC) parallel to SAE JA1012. Covers RCM application in industrial contexts beyond aviation and oil & gas. Z-008:2024 §9.2.1 explicitly points here for task-selection logic. |
| ISO 14224:2016 | Not RCM itself, but the taxonomy. Provides the equipment-class definitions and failure-mode codes (ELP, FTS, BRD, etc.) that feed in from Step 2. |
Z-008 tells you that you need to select tasks based on consequence. JA1011/JA1012 tells you how. The Bluestream tool implements the how with Z-008 consequence categories as the input.
Z-008 edition change: Z-008:2024 superseded Z-008:2017 on 20 December 2024. The clause numbers shifted (consequence classification moved from §7 to §8; maintenance programme from §8 to §9). The tool supports both editions — if you're maintaining a legacy program under 2017, the references in the output will match. For new work, use 2024. See glossary for the full changelog.
Core concept: evident vs hidden failure
Every failure mode falls into one of two categories. This is the first question the tool asks, and it determines which branch of the decision tree you walk.
Evident failure
An evident failure is one the operating crew will become aware of under normal conditions, without needing to perform a test. The failure is self-announcing.
Examples:
- A process pump stops delivering flow — the downstream pressure gauge drops and the control system alarms.
- A motor bearing seizes — noise, vibration, temperature alarm, possibly smoke.
- A seal starts leaking — visible puddle, process containment alarm.
- A compressor trips on high vibration — the machine stops, the control room knows instantly.
For evident failures, the tool evaluates three task types in strict order: on-condition monitoring (CBM) first, scheduled restoration (SR) second, scheduled discard (SD) third.
Hidden failure
A hidden failure is one that will not become apparent to the crew during normal operation. You only find out it has occurred when the function is demanded — and by then, it's too late.
Hidden failures almost always involve protective functions: equipment that sits idle until something goes wrong elsewhere, at which point it's supposed to spring into action. If it has failed in the meantime, no one knows.
Examples:
- Pressure relief valve stuck closed. The process runs normally for months. It only matters if an overpressure event occurs — and then the valve doesn't open and the vessel ruptures.
- Fire and gas detector with failed sensor. Normal plant operation doesn't test it. Only a real fire reveals that detection is dead.
- Emergency shutdown (ESD) valve with a seized actuator. Works fine as long as nobody pushes the button.
- Standby generator that won't start. Grid power is up, so no demand. The generator sits there failed until a blackout demands it.
- Smoke alarm with dead battery. Same principle, household scale.
The two-failure risk
Hidden failures don't cause accidents by themselves. They cause accidents in combination with the protected failure — the thing they were supposed to catch. The risk calculation is therefore about the probability of both failures coinciding: the protective device failed and the demand for protection arose. This is why hidden failures get their own dedicated task type (failure-finding) and their own interval calculation — the test interval is set to keep the combined probability below a tolerable level.
For hidden failures, the tool evaluates only one task type: failure-finding (FF) — a scheduled functional test to reveal whether the hidden function is still working. CBM, SR, and SD are skipped because they don't apply to hidden failures in the same way.
Rule of thumb: if you'd need to specifically test or inspect the item to know it had failed, it's hidden. If the failure announces itself through process effects, alarms, or observable symptoms, it's evident.
Core concept: consequence categories
After evident/hidden, the tree needs to know the consequence category of the failure. This drives how strict the task-selection logic is. A failure that can kill people gets evaluated very differently from a failure that just costs money.
RCM-II defines three consequence categories for evident failures, plus a fourth branch for hidden:
| Category | Meaning | Examples |
|---|---|---|
| Safety / Environment (S/E) | Failure directly harms people or the environment | Loss of containment on a toxic-gas line; failure of a fire pump; structural failure |
| Operational | Failure affects production output, product quality, or operational cost | A feed pump on a process train with no installed spare; a critical valve on the main product line |
| Non-operational | Failure affects only direct repair cost | A redundant pump where the spare can carry full duty; a utility that has no operational impact |
| Hidden | Protective function that only matters when demanded | PSV, ESD valve, F&G detector, standby generator |
How the category is determined
The Bluestream tool doesn't re-ask you for consequence information in Step 3 — you already provided it in Step 1 (Criticality Classification). The RCM category is inherited automatically from your Step 1 output, using this logic:
The override link is important. Inheritance is a sensible default, but it's not always right. A pump bearing might sit under a C3 HSE criticality because of fluid properties — but the specific bearing failure mode is not a safety issue, it's a production one. Override the category for that failure mode, and the tool re-derives the strategy. The justification string captures that the category was overridden, so an auditor reading the output sees the reasoning.
Why inheritance (not re-asking): NORSOK Z-008:2024 §9.2.1 explicitly points to the consequence classification as the input to task selection. Re-asking in Step 3 creates the risk of inconsistency between the two analyses. Inherit by default, override only where the analyst has a specific reason.
Core concept: the P-F interval
The P-F interval is the single most important concept in condition-based maintenance. It's the reason on-condition tasks work at all — and when the interval doesn't exist or isn't usable, CBM is off the table and the tool moves on to time-based tasks.
Definition
The P-F interval is the time between the point at which a failure first becomes detectable (P — Potential failure) and the point at which it progresses to actual functional failure (F — Functional failure).
Example: a pump bearing
A feed pump runs continuously. For 40 weeks the vibration signature sits flat — that's normal running wear. In week 41, vibration starts climbing. That's P — a detectable indicator of impending failure. The bearing doesn't fail yet; it just starts warning. Vibration keeps rising over the next 8 weeks. In week 49, vibration reaches a level at which the bearing is about to seize. Week 50, it seizes — that's F. The pump stops.
The P-F interval is 9 weeks. If the maintenance team checks vibration every 4 weeks, they'll catch the failure in progress and have time to plan an intervention. If they check every 12 weeks, they'll miss the window entirely — the bearing will go from fine-at-last-check to seized-and-stopped with no warning.
The three requirements for a valid CBM task
For an on-condition task to be feasible, all three of these must be true:
- A clear P point exists. There must be a measurable indicator — vibration, temperature, oil debris, acoustic emission, flow, pressure trend — that changes before functional failure. Random failures with no warning signature are not candidates for CBM.
- The P-F interval is long enough to act on. You need time to detect the indicator, diagnose the problem, plan the intervention, mobilise resources, get parts, obtain permits, and execute the repair before F. If the P-F interval is 3 days and your procurement cycle is 6 weeks, the task is not worth doing — you can't act on the information in time.
- The P-F interval is reasonably consistent. If the interval varies wildly — 2 days for some failures, 2 years for others — you cannot set a reliable inspection frequency. You either inspect too often (wasteful) or too rarely (miss the short-interval failures).
The tool's CBM feasibility question is really asking all three of these at once. If any one fails, CBM is not feasible and the tree moves to scheduled restoration.
Setting the monitoring interval
The standard rule: set the monitoring interval to half the P-F interval, or less. This guarantees the failure cannot progress from P to F between two consecutive inspections without being caught. The underlying principle is the same as the Nyquist sampling rule in signal processing — you have to sample at least twice as often as the event you're trying to observe.
Typical P-F intervals by monitoring technique
| Technique | Typical P-F interval | Typical inspection interval |
|---|---|---|
| Vibration analysis (rolling bearings) | 1 to 9 months | Monthly to quarterly |
| Oil analysis (gearboxes) | 1 to 6 months | Monthly to bi-monthly |
| Thermography (electrical connections) | Weeks to months | Quarterly to annually |
| Acoustic emission (structural cracks) | Days to weeks | Continuous monitoring preferred |
| Process parameter trending (flow, ΔP) | Hours to weeks | Continuous, automated |
| Ultrasonic thickness (corrosion) | Months to years | Annually to once per 5 years |
Common mistake: assuming a CBM task is feasible just because the monitoring technology exists. A vibration sensor can be fitted to anything, but if the P-F interval is shorter than your reaction time, the task is useless. The question isn't "can we monitor?" — it's "can we monitor, detect, and act before the failure occurs?"
Core concept: the six task types
RCM recognises four proactive task types and two default actions. The tool evaluates each in a strict order — if the first doesn't work, try the next.
CBM On-condition maintenance
Monitor an indicator that changes before failure. Act when the indicator crosses a threshold. Requires a valid P-F interval (see above). Examples: vibration trending, oil analysis, thermography, process parameter monitoring, acoustic emission.
When it's the right answer: there's a measurable warning sign, the P-F interval is actionable, and monitoring is cheaper than letting the failure happen.
SR Scheduled restoration
At a fixed interval, restore the item to a known-good condition — typically by overhauling, refurbishing, or recoating it. The item continues in service after restoration. Requires age-related wear-out (failure rate rises sharply at an identifiable age) AND a restoration action that actually returns the item to its original resistance to failure.
When it's the right answer: you can't detect a clear P-F signal, but the item wears out predictably with age and can be refurbished cost-effectively. Example: a gearbox overhauled every 10 years; a pump rebuild cycle.
When it isn't: for items that don't show age-related wear-out — most electronics, many hydraulic components. Running clocks on them does nothing. For items where "restoration" doesn't actually restore — you can't meaningfully overhaul a seized bearing back to factory specification.
SD Scheduled discard
At a fixed interval, discard the item and install a new one. No restoration is attempted. Requires age-related wear-out AND a cost-effective replacement strategy. Applies primarily to items that cannot be meaningfully restored.
When it's the right answer: items with a known wear-out life that are cheaper to replace than to overhaul. Examples: filter cartridges, o-rings and seals as part of a major service, batteries with a documented end-of-life, lamps in safety-critical lighting.
FF Failure-finding
Periodically test whether a hidden function still works. Does not prevent failure — it finds failures that have already occurred so they can be corrected before a demand arises. Only applies to hidden failures.
When it's the right answer: the function is hidden (protective devices, standby equipment) and a test is feasible without excessive disturbance. Examples: PSV function test, ESD valve stroke test, fire pump weekly run-test, fire and gas detector response test.
The FF interval is calculated to keep the combined probability of hidden failure and demand for the protected function below a tolerable level. For safety-critical barriers this interval is typically prescribed by IEC 61511 (SIL) or operator-specific risk acceptance criteria, not chosen by the analyst.
RTF Run to failure
Don't do any proactive task. Allow the failure to occur, then respond with corrective maintenance. This is a deliberate choice, not a default — it's the right answer when consequences are low and no task can prevent or predict the failure cost-effectively.
When it's the right answer: non-operational consequences, no feasible proactive task, corrective repair is cheap and quick. Examples: a redundant pump where the spare can carry duty; a non-critical instrument; a lamp that can be changed in five minutes.
RED Redesign
If no proactive task is feasible AND the consequences are Safety/Environment, redesign is compulsory under RCM-II. The equipment itself has to change — a different design, a different material, an added safety layer, or a different operating envelope. Redesign is not "something the analyst might want to consider" — the standard mandates it when the alternative is accepting unacceptable risk.
In the Bluestream tool, RED is derived (never selected by the analyst). If you walk the tree and end up with no feasible task, and the consequence category is Safety/Environment, the output is RED. The analyst's job is then to flag the item to the design team, not to continue the maintenance analysis as if a solution existed.
REVIEW Cost-benefit review required
If no proactive task is feasible AND the consequences are Operational (not S/E), RCM-II calls for a cost-benefit comparison between redesigning and running to failure. Neither is automatically right — it depends on the specific economics of the asset.
The Bluestream tool flags these cases as REVIEW rather than auto-defaulting to RTF. The analyst records the cost-benefit decision in the concept rationale. This keeps the methodology honest — an unresolved Operational case doesn't silently disappear into RTF.
Core concept: feasible AND worth doing
For each proactive task type, the tool asks two questions, not one:
- Is the task technically feasible? Can you actually do it? (Does the P-F interval exist? Does the item show age-related wear-out?)
- Is the task worth doing? Does it make economic or safety sense? (Is the monitoring cost less than the failure cost? Does restoration actually reduce risk to a tolerable level?)
Both must be yes for the task to be selected. This is explicit in SAE JA1012 and it's the most commonly skipped step in informal RCM analyses — people confirm feasibility and move on, without asking whether the task is justified.
Example of the distinction
Feasible but not worth doing: You can fit vibration monitoring to every small pump in a utility system. Technically feasible — the P point exists, the interval is reasonable, the indicator is measurable. But the failure consequence is trivial (run the spare, repair at leisure) and the cost of monitoring, data analysis, and alarm response is substantial. Not worth doing. Strategy: RTF, not CBM.
Worth doing but not feasible: A fire detector in a hazardous area would be very worth monitoring continuously — but the detector design doesn't expose a monitorable signal between tests. Not technically feasible. Strategy: FF (scheduled functional test), not CBM.
The full decision tree
This is the tree the tool walks for every failure mode. Read left to right. Green outcomes are proactive tasks selected by the tree; orange/red outcomes are derived defaults when no task is feasible.
How the tool walks you through it
For each failure mode from your FMECA, the tool asks at most nine questions — usually fewer, because the tree terminates as soon as a task is selected or a default is derived.
Q1. Evident or hidden?
The first question, and the one that determines which branch of the tree you walk. The sub-text in the tool gives you the crisp test: under normal conditions, will the operating crew become aware that this failure has occurred? Self-announcing failures are evident. Failures that only reveal themselves on demand or during test are hidden.
Q2/Q3. CBM evaluation
Asked for evident failures. Q2: is a CBM task technically feasible? This is really asking about the P-F interval — does a measurable warning exist, is it long enough to act on, is it consistent? If Yes, the tool asks Q3: is the CBM task worth doing? Compare monitoring cost (sensor + analysis + alarm response) to failure cost. If both Yes, strategy is CBM and the tree stops.
Q4/Q5. Scheduled restoration evaluation
Asked only if CBM failed (either feasibility or worth). Q4: does the item show age-related wear-out AND can restoration return it to its original resistance to failure? If Yes, Q5: is restoration cost-justified against failure cost? If both Yes, strategy is SR.
Q6/Q7. Scheduled discard evaluation
Asked only if SR failed. Q6: does discarding at a fixed age reduce failure risk? Q7: is periodic replacement cost-justified? If both Yes, strategy is SD.
Q8/Q9. Failure-finding evaluation (hidden only)
Asked for hidden failures, replacing the CBM/SR/SD evaluations. Q8: is a functional test feasible without excessive disturbance and is the test reliable? Q9: does the test interval keep the combined probability of hidden failure plus demand below the tolerable risk threshold? If both Yes, strategy is FF.
No task feasible — derivation
If the tree exits without selecting a task (all evaluations failed), the tool derives the strategy from the consequence category:
- Safety/Environment → RED redesign is compulsory. Flag to design team; the maintenance program cannot solve this.
- Operational → REVIEW cost-benefit comparison of redesign vs run-to-failure is required. Analyst records the decision in the concept rationale.
- Non-operational → RTF run-to-failure is the standards-compliant default.
Video walkthrough
Full screen-recording of the RCM/RBM tool in use — from FMECA hand-off through to summary output, with commentary on each decision point.
Replace YOUR_VIDEO_ID_HERE with the YouTube video ID when published.
Three worked examples
The same tool, three very different failure modes. Each example shows the path through the tree and the resulting strategy with justification.
Example 1 — Feed pump, seal face wear
CBMAsset: Centrifugal feed pump, single-stage, end-suction, continuous duty. Pumping a clean hydrocarbon at 15 bar, 80 °C. Redundant spare available but changeover takes 4 hours.
Failure mode (from FMECA): ELP — external leakage, process medium, from mechanical seal face wear.
Step 1 criticality: HSE C2 (hydrocarbon above flashpoint, moderate toxicity), Production C3 (4-hour changeover impacts throughput), barrier element = No. Inherited RCM category: Safety/Environment.
Q2 CBM feasible? Yes — vibration monitoring + process pressure trending give 2–4 weeks P-F interval.
Q3 CBM worth doing? Yes — online vibration sensor is already installed for the motor; marginal cost of trending is negligible against cost of an unplanned leak.
→ Strategy: CBM.
Justification: Category: Safety/Environment (inherited from Criticality). Evident failure branch. On-condition task selected: P-F interval measurable AND monitoring cost-justified. Ref: Z-008:2024 §9.2.1; SAE JA1012 §3.4.
Example 2 — Pressure Relief Valve, fails to open
FFAsset: Spring-operated PSV on a gas processing vessel. Set pressure 25 bar. Last resort overpressure protection.
Failure mode (from FMECA): FTO — failure to open on demand. Can be caused by corrosion, fouling, spring set, or seat adhesion.
Step 1 criticality: Barrier element = Yes (realises the overpressure protection function per ISO 17776). Inherited RCM category: Safety/Environment.
Q8 FF feasible? Yes — PSV pop-test at bench or online test rig is a standard procedure.
Q9 FF worth doing? Yes — interval set per operator SIL/barrier requirements (typically 2–5 years). Cost is small relative to vessel rupture consequence.
→ Strategy: FF.
Justification: Category: Safety/Environment (inherited). Hidden failure branch. Failure-finding task selected: functional test feasible AND test interval keeps multi-failure probability tolerable. Ref: Z-008:2024 §9.3; SAE JA1012 §3.7.
Example 3 — Obsolete fire detector, no test access
REDAsset: Legacy point-type fire detector installed in a space that has been reconfigured since commissioning. Detector is now behind permanent cladding — accessing it for functional test requires 4-hour scaffolding + permit-to-work cycle.
Failure mode (from FMECA): FTF — failure to function on demand.
Step 1 criticality: Barrier element = Yes. Inherited RCM category: Safety/Environment.
Q8 FF feasible? No — access cost is disproportionate, scheduled testing is effectively impractical at any meaningful frequency.
→ No feasible task. Category = S/E. Strategy derived: RED.
Justification: Category: Safety/Environment (inherited). Hidden failure branch. No feasible FF task identified. Consequences are unacceptable (Safety/Environment). Redesign is compulsory per SAE JA1012 §3.2 — this must be referred to design authority to either relocate the detector, replace with a testable model, or add a second detector in an accessible location. Ref: Z-008:2024 §9.3; SAE JA1012 §3.2.
Note: This is a deliberately realistic example. Facilities accumulate inaccessible safety-critical hardware over their lifetime through modifications, debottlenecking, and insulation changes. RCM flags the problem honestly — continuing to "schedule" a test that nobody can practically execute is worse than no task at all, because it creates false paper compliance.
Common pitfalls
Treating CBM as a default
It's the first task the tree evaluates, and the trendiest in the industry. Analysts who are in a hurry assume CBM is the "modern" answer and stop interrogating. But CBM is only valid when a P-F interval exists, is actionable, and is consistent. Random failures, infant-mortality failures, and failures with no pre-failure signal are not CBM candidates — and that's a large fraction of real-world failure modes. Don't select CBM because it sounds sophisticated; select it because the P-F interval supports it.
Accepting inherited category without thinking
The tool inherits consequence category from Step 1 — that's the sensible default. But Step 1 classified the asset at the equipment level, not the failure-mode level. A pump might be classified S/E because of fluid hazards, but a specific bearing failure mode on that pump might have only operational consequence (the bearing can't release the hazardous fluid). The override link exists precisely for this. Use it when the inherited category doesn't fit the specific failure mode.
Skipping the worth-doing test
Feasible and worth-doing are different questions. Plenty of tasks are feasible but not cost-justified — especially CBM on low-consequence equipment where the sensor + analytics + response stack costs more than the occasional failure. Answering Q2 and moving on, without interrogating Q3, produces an over-engineered program. The two-question drill exists to catch this.
Missing hidden failures
Protective devices, standby equipment, interlocks, and failsafes are easy to forget about precisely because they work silently. If Step 2 (FMECA) didn't capture them, Step 3 can't assess them. Review the FMECA specifically for: pressure relief valves, ESD valves, F&G detectors, fire pumps, standby generators, battery-backed instruments, interlock circuits. Every protective function is a candidate hidden failure.
Trying to "choose" redesign
This tool does not let the analyst select RED as an answer to a question. That's deliberate. SAE JA1011 is explicit: redesign is the consequence of no feasible task being available against unacceptable consequences, not a substitute for task analysis. If you think you need redesign, walk the tree honestly — answer the feasibility questions truthfully. If RED falls out of the tree, it's compulsory and you escalate it. If it doesn't, the task it did select is the right answer, whether you like it or not.
Paper compliance for untestable barriers
As Example 3 shows, there's a strong institutional temptation to schedule a test-every-N-months task on a safety barrier even when nobody can practically execute it. This creates the illusion of compliance while leaving the actual risk uncontrolled. RCM-II and NORSOK Z-008 both require that tasks be actually executable. If FF isn't feasible, escalate to redesign. Don't schedule fiction.
References
- NORSOK Z-008:2024 — Risk based maintenance and consequence classification. Published by Standards Norway, December 2024. standard.no
- NORSOK Z-008:2017 — Predecessor edition, superseded December 2024. Still referenced for legacy programs maintained under 2017 framework.
- SAE JA1011 — Evaluation Criteria for Reliability-Centered Maintenance (RCM) Processes. SAE International.
- SAE JA1012 — A Guide to the Reliability-Centered Maintenance (RCM) Standard. Implementation guidance for JA1011.
- IEC 60300-3-11 — Dependability management — Part 3-11: Application guide — Reliability centred maintenance. International Electrotechnical Commission.
- ISO 14224:2016 — Petroleum, petrochemical and natural gas industries — Collection and exchange of reliability and maintenance data for equipment. Annex A taxonomy and Annex B failure mode codes used by the Bluestream FMECA tool (Step 2).
- Nowlan, F. Stanley and Heap, Howard F. — "Reliability-Centered Maintenance." United Airlines / US Department of Defense, 1978. The origin of RCM as a methodology; still the clearest exposition of the task-selection logic.
- Moubray, John. — "RCM II: Reliability-Centred Maintenance." Second edition, Industrial Press, 1997. The standard industrial-engineering reference for RCM in practice.