We have all read about and acknowledged the merits of root-cause failure-analysis methods. But not everyone agrees on when and how these methods should be used. In times past, the author worked for a while within an organization so enamored with failure analyses that the formal reports were to be written by a machinery engineer for each and every plant machinery failure. This seemed unusual and even a bit extreme.
Yet, the issue begs an important question: When is formal failure analysis justified? Before answering and explaining, we need to briefly agree on the customary definition of a root cause failure analysis. A root-cause failure analysis (RCFA) is a formal analysis of an equipment or system failure with the intent of uncovering all latent causal factor(s) and contributing factors, so that similar future failures may be prevented.
Latent causal factors are unseen or hidden causes hiding just below the surface of the wreckage. Whenever a repeat failure occurs, we must realize that the hidden causes are still at play and that future failures are likely. It certainly makes it imperative that during the failure-analysis process, the analyst (or analysts) will look beyond the physical evidence and uncover the true latent causes for the failure. Here are a few quick examples of latent causal factors in process pumps:
Parallel operation of pumps drives a pump off its allowable flow range, which then causes internal recirculation and high bearing loads
Low-flow pump operation due to process changes results in frequent seal failures
Oil supply is inconsistent because of inherent issues with an inexpensive oil delivery choices (oil ring degradation)
Design weaknesses in certain bearing protector seals can result in oil contamination
Oil contamination is caused by poor lubricant storage practices
Oil consolidation efforts often lead to unwise oil replacement choices, and ultimately causes bearing failures
Bearings selected in accordance with the generalized recommendations of prominent industry standards are not always the best choice.
Root-cause failure analysis: A case history.
A petrochemical plant and oil refinery was experiencing recurring failures of a cylindrical roller bearing in a critical centrifugal pump (Fig. 1). Transition from belt-drive to direct-drive configuration of the pump led to unexplained bearing failures and costly downtime.
| Fig. 1. Recurring failure of cylindrical roller |
bearing in critical centrifugal pump.
Photo credit: NSK Bearings Europe.
The bearing manufacturer investigated the application conditions and the failed cylindrical roller bearings. Changing from a belt-drive to a direct-drive arrangement had drastically reduced the radial load on the existing bearing. In turn, this had led to rollers slipping on the inner raceway. The solution was simple: the original bearing was readily replaced by a deep-groove ball bearing.
In essence, the bearing manufacturer determined that a light load on the cylindrical roller bearing was the cause of bearing distress. The latent root cause was the transition from belt-drive system to a direct-drive arrangement. The solution to this issue was to select the proper bearing for the new load conditions.
Time is money.
The RCFA method is one of the most powerful tools of a reliability program, but it is also one of the most time-consuming and, thus, costly endeavors. Since time is money, RCFAs must be administered judiciously.
Like all decisions made in a competitive environment, the decision of when to conduct an RCFA must be based on economics. Since not all failures are created equal, not all failures warrant the same level of analysis. The decision to conduct an RCFA should be based on the answers to two questions:
1. What is the consequence of the failure? Obviously as the failure consequence rises, the need for analysis becomes more critical.
2. What is the failure frequency? The higher the failure rate the more attention the piece of equipment should get.
If we denote the consequence as C and the rate of failure, F, we can define the risk, R, of failure as the product of consequence and the failure rate:
Risk (R) = C 3 F
So, if a pump that costs $20,000 to repair fails four times a year, this represents an annualized risk of $80,000/yr, i.e., $20,000 3 4. Next, we will construct a risk matrix based on a broad range of consequences and failure frequencies (see Table 1).
Notice as risk levels (i.e., consequence times frequency) exceed $1 million/yr, they are highlighted in a rose color. Risk levels of $100,000/yr are highlighted in yellow, and risk levels of $10,000/yr are highlighted in green. Risk levels of $1,000 and below are shown in gray. They are assumed to warrant little or no concern. In general, we can say that high-risk issues are in the upper right-hand corner of risk maps and low-risk issues are in the lower left-hand corner of risk maps (Table 1).
One can consider these annualized risk levels to be potential revenue that can be recovered if the root cause of an issue falling in one of these cells is found, and the required remedial steps are implemented. With that being said, we should view all RCFA pursuits as potential projectsand, as on any projects, we must assess their cost-to-benefit ratio.
Suppose we have an event that is occurring 10 times a year with a consequence of $10,000. This represents an annualized risk of $100,000. If the total cost of remedial action and RCFA is $20,000, the project will have a payback of about 73 days. On the other hand, if there is a $100,000 consequence that occurs once every 100 years, this only represents a risk of $1,000/yr. It would not make much sense to assign a multi-disciplinary analysis team to investigate this failure.
The take-away here is that annualized risk is a better means of determining the level of analysis justifiednot the consequence level. However, it is possible to also set a maximum allowable loss trigger for a failure investigation. For example, a site may choose to investigate all events resulting in losses of $1 million or more, regardless of their frequency. Together, the risk map and the maximum allowable loss methods can provide clear RCFA guidance. Most production companies take this basic risk-based approach and develop their own risk matrix, as shown in Table 2.
Each organization must define what high-, medium- and low-consequence events are, along with low-, medium- and high-failure frequencies. For example, a company might choose to define high-, medium- and low-consequence events as those resulting in losses of more than $250,000, $100,000 to $249,999, $25,000 to $99,999 respectively. High-, medium- and low-failure frequencies could be defined as more than two failures/yr, more than one failure every two years, or less than one failure every two years, respectively. So, a $150,000 failure occurring every year would fall in the B risk level category, but a $150,000 failure occurring twice or more a year would fall in the A risk level category.
Non-monetary events, such as environmental releases and safety events, may also be included in a risk matrix. An example of a high-attention environmental event could be a release of 10 barrels or more of any hazardous fluid from the process. A high-risk safety event could be any lost time accident.
Notice in the risk map (Table 2) that even high-consequence events occurring at lower frequencies are considered A risk levels. This is because this risk map incorporates a maximum allowable loss trigger regardless of event frequency for high-consequence events. It serves as managements way of conveying that all major events are unacceptable regardless of frequency.
Having defined A, B and C risk level events, we need to determine the extent to which analysis is warranted at each of these levels. Here are some general guidelines as to when and whom to assign to the analysis of A, B and C level events:
Level A issues.
Failure events falling in the A region of the risk matrix justify the highest level of analysis. This usually means that a multidisciplinary team should be used to mine the data and procure the physical evidence in order to determine the most probable cause or causes of the failure. Typically, this requires a team consisting of three to five technical specialistsoften composed of a machinery engineer or technician, vibration tech, process engineer, operators, etc.should be formed to leave no stone unturned during the investigation. The team may take a few days or weeks to finalize their findings and recommendations.
Level B issues.
Failure events falling in the B region of the risk matrix are considered costly but do not justify a full-blown team approach. Level B analyses are typically conducted by an engineer or highly skilled machinery technician working with process support. This type of analysis rarely takes more than one week for an individual to finalize.
Level C issues.
Failure events falling in the C region of the risk matrix can be considered bread and butter investigations. Even though they are termed low-consequence events, they typically represent the largest number of RCFAs. Based on their sheer numbers, they can represent the greatest cost-saving potential for a plant. Level C RCFAs are often conducted by mechanics using the 5 Why or similar methods. A systematic use of Level C RCFAs can lead to a dramatic and rapid reduction of repeat failures.
Proceeding to analyze.
The root cause failure analysis effort should be considered one of the main pillars of machinery reliability. We have all seen formal failure investigations solve the most perplexing and costly problems. But to ensure success, participants must be given the proper training and the time to uncover latent-root causes.
The payoff is well worth the price. Consistent administration of these methods can empower an organization to whittle down early and frequent machinery failures until only predictable end-of-life failures are experienced. HP
|The author |
Robert Perez is the author of Operators Guide to Centrifugal Pumps and co-creator and editor of the PumpCalcs.com website. He has more than 30 years of rotating equipment experience in the petrochemical industry and has numerous machinery reliability articles to his credit. Mr. Perez holds a BS degree in mechanical engineering from Texas A&M University at College Station and an MS degree in mechanical engineering from the University of Texas at Austin. Mr. Perez holds a Texas PE license. He can be reached at firstname.lastname@example.org.