We have all read about and acknowledged the merits of
root-cause failure-analysis methods. But not everyone agrees on
when and how these methods should be used. In times past, the
author worked for a while within an organization so enamored
with failure analyses that the formal reports were to be
written by a machinery engineer for each and every plant
machinery failure. This seemed unusual and even a bit
Yet, the issue begs an important question: When is formal
failure analysis justified? Before answering and explaining, we
need to briefly agree on the customary definition of a root
cause failure analysis. A root-cause failure analysis
(RCFA) is a formal analysis of an equipment or system failure
with the intent of uncovering all latent causal
factor(s) and contributing factors, so that similar future
failures may be prevented.
Latent causal factors are unseen or hidden
causes hiding just below the surface of the wreckage. Whenever
a repeat failure occurs, we must realize that the hidden causes
are still at play and that future failures are likely. It
certainly makes it imperative that during the failure-analysis
process, the analyst (or analysts) will look beyond the
physical evidence and uncover the true latent causes for the
failure. Here are a few quick examples of latent causal factors
in process pumps:
Parallel operation of pumps drives a pump off its
allowable flow range, which then causes internal recirculation
and high bearing loads
Low-flow pump operation due to process changes
results in frequent seal failures
Oil supply is inconsistent because of inherent
issues with an inexpensive oil delivery choices (oil ring
Design weaknesses in certain bearing protector
seals can result in oil contamination
Oil contamination is caused by poor lubricant
Oil consolidation efforts often lead to unwise oil
replacement choices, and ultimately causes bearing
Bearings selected in accordance with the
generalized recommendations of prominent industry standards are
not always the best choice.
Root-cause failure analysis: A case history.
A petrochemical plant and oil refinery was experiencing recurring
failures of a cylindrical roller bearing in a critical
centrifugal pump (Fig. 1). Transition from belt-drive to
direct-drive configuration of the pump led to unexplained
bearing failures and costly downtime.
1. Recurring failure of cylindrical
bearing in critical centrifugal pump.
Photo credit: NSK Bearings Europe.
The bearing manufacturer investigated the application
conditions and the failed cylindrical roller bearings. Changing
from a belt-drive to a direct-drive arrangement had drastically
reduced the radial load on the existing bearing. In turn, this
had led to rollers slipping on the inner raceway. The solution
was simple: the original bearing was readily replaced by a
deep-groove ball bearing.
In essence, the bearing manufacturer determined that a light
load on the cylindrical roller bearing was the cause of bearing
distress. The latent root cause was the transition from
belt-drive system to a direct-drive arrangement. The solution
to this issue was to select the proper bearing for the new load
Time is money.
The RCFA method is one of the most powerful tools of a reliability program, but it is also
one of the most time-consuming and, thus, costly endeavors.
Since time is money, RCFAs must be administered
Like all decisions made in a competitive environment, the decision of when to
conduct an RCFA must be based on economics. Since not all
failures are created equal, not all failures warrant the same
level of analysis. The decision to conduct an RCFA should be
based on the answers to two questions:
1. What is the consequence of the failure? Obviously as
the failure consequence rises, the need for analysis becomes
2. What is the failure frequency? The higher the failure
rate the more attention the piece of equipment should get.
If we denote the consequence as C and the rate of
failure, F, we can define the risk, R, of
failure as the product of consequence and the failure rate:
Risk (R) = C 3 F
So, if a pump that costs $20,000 to repair fails four times
a year, this represents an annualized risk of $80,000/yr, i.e.,
$20,000 3 4. Next, we will construct a risk matrix based on a
broad range of consequences and failure frequencies (see Table
Notice as risk levels (i.e., consequence times frequency)
exceed $1 million/yr, they are highlighted in a rose color.
Risk levels of $100,000/yr are highlighted in yellow, and risk
levels of $10,000/yr are highlighted in green. Risk levels of
$1,000 and below are shown in gray. They are assumed to warrant
little or no concern. In general, we can say that high-risk
issues are in the upper right-hand corner of risk maps and
low-risk issues are in the lower left-hand corner of risk maps
One can consider these annualized risk levels to be
potential revenue that can be recovered if the root cause of an
issue falling in one of these cells is found, and the required
remedial steps are implemented. With that being said, we should
view all RCFA pursuits as potential projectsand, as on any projects, we must assess their
Suppose we have an event that is occurring 10 times a year
with a consequence of $10,000. This represents an annualized
risk of $100,000. If the total cost of remedial action and RCFA
is $20,000, the project will have a payback of about
73 days. On the other hand, if there is a $100,000 consequence
that occurs once every 100 years, this only represents a risk
of $1,000/yr. It would not make much sense to assign a
multi-disciplinary analysis team to investigate this
The take-away here is that annualized risk is a better means
of determining the level of analysis justifiednot the
consequence level. However, it is possible to also set a
maximum allowable loss trigger for a failure investigation. For
example, a site may choose to investigate all events resulting
in losses of $1 million or more, regardless of their frequency.
Together, the risk map and the maximum allowable loss methods
can provide clear RCFA guidance. Most production companies take
this basic risk-based approach and develop their own risk
matrix, as shown in Table 2.
Each organization must define what high-, medium- and
low-consequence events are, along with low-, medium- and
high-failure frequencies. For example, a company might choose
to define high-, medium- and low-consequence events as those
resulting in losses of more than $250,000, $100,000 to
$249,999, $25,000 to $99,999 respectively. High-, medium- and
low-failure frequencies could be defined as more than two
failures/yr, more than one failure every two years, or less
than one failure every two years, respectively. So, a $150,000
failure occurring every year would fall in the B
risk level category, but a $150,000 failure occurring twice or
more a year would fall in the A risk level
Non-monetary events, such as environmental releases and safety
events, may also be included in a risk matrix. An example of a
high-attention environmental event could be a
release of 10 barrels or more of any hazardous fluid from the
process. A high-risk safety event could be any lost time
Notice in the risk map (Table 2) that even high-consequence
events occurring at lower frequencies are considered
A risk levels. This is because this risk map
incorporates a maximum allowable loss trigger regardless of
event frequency for high-consequence events. It serves as
managements way of conveying that all major events are
unacceptable regardless of frequency.
Having defined A, B and C risk level events, we need to
determine the extent to which analysis is warranted at each of
these levels. Here are some general guidelines as to when and
whom to assign to the analysis of A, B and C level events:
Level A issues.
Failure events falling in the A region of the
risk matrix justify the highest level of analysis. This usually
means that a multidisciplinary team should be used to mine the
data and procure the physical evidence in order to determine
the most probable cause or causes of the failure. Typically,
this requires a team consisting of three to five technical
specialistsoften composed of a machinery engineer or
technician, vibration tech, process engineer, operators,
etc.should be formed to leave no stone unturned during
the investigation. The team may take a few days or weeks to
finalize their findings and recommendations.
Level B issues.
Failure events falling in the B region of the
risk matrix are considered costly but do not justify a
full-blown team approach. Level B analyses are
typically conducted by an engineer or highly skilled machinery
technician working with process support. This type of analysis
rarely takes more than one week for an individual to
Level C issues.
Failure events falling in the C region of the
risk matrix can be considered bread and butter
investigations. Even though they are termed low-consequence
events, they typically represent the largest number of RCFAs.
Based on their sheer numbers, they can represent the greatest
cost-saving potential for a plant. Level C RCFAs are often
conducted by mechanics using the 5 Why or similar
methods. A systematic use of Level C RCFAs can lead to a
dramatic and rapid reduction of repeat failures.
Proceeding to analyze.
The root cause failure analysis effort should be considered
one of the main pillars of machinery reliability. We have all seen formal
failure investigations solve the most perplexing and costly
problems. But to ensure success, participants must be given the
proper training and the time to uncover latent-root causes.
The payoff is well worth the price. Consistent
administration of these methods can empower an organization to
whittle down early and frequent machinery failures until only
predictable end-of-life failures are experienced.
Robert Perez is the author of
Operators Guide to Centrifugal Pumps and
co-creator and editor of the PumpCalcs.com website. He
has more than 30 years of rotating equipment experience
in the petrochemical industry and
has numerous machinery reliability articles to his
credit. Mr. Perez holds a BS degree in mechanical
engineering from Texas A&M University at College
Station and an MS degree in mechanical engineering from
the University of Texas at Austin. Mr. Perez holds a
Texas PE license. He can be reached at firstname.lastname@example.org.