Risk Management for Lubricated Machines
A central component to plant reliability is identifying and managing the risk associated to the equipment within the plant. As one gets entrenched in the vast amount of technologies and methodologies to increase plant reliability, each possible effort must be judged by the analysis of the benefit and the cost associated to achieving that benefit. This cannot easily be done without an understanding of the equipment’s propensity to fail. A failure might be a complete loss of productive function or could simply be a reduced productivity or exposed hazard to the surroundings. For the essential equipment in a plant, lubricated machines are of particular concern because they tend to represent the assets that have moving parts, which introduces more dynamic fail points. Or in terms of plant reliability, lubricated rotating equipment represents a subset of plant assets that are by default at higher risk. For this reason, reliability engineers often put an emphasis on risk analysis and risk mitigation on lubricated rotating equipment. These efforts over the decades have resulted in many proven methods to narrow down which machines are most at risk, meticulously analyze the degree of risk and solutions to mitigate this risk.
One way to better understand the concept of risk for lubricated machines is to see it as a product of two variables: (1) the probability of failure and (2) the consequence of failure. This can be analogous to a person standing on the edge of a cliff.
Risk = The Probability of Failure x The Consequence of Failure
First, the probability of a person falling off that cliff is characterized by questions such as: How close is he to the cliff? How steep is the slope at the cliff? Are there guardrails? Is this a young reckless person or is this a rock climber experienced with cliffs?
Second, the consequence of failure can be characterized by questions such as: How high is the cliff? Are there rocks or deep water at the bottom of the cliff? Is there a safety net at the bottom? Is this person an experienced base jumper with a wingsuit? With this combination of questions, one can assess whether that cliff daredevil is at risk or not.
The same types of questions can be asked for lubricated machines to assess their level of risk. Answering these questions is fundamentally what drives reliability-centered maintenance (RCM). When RCM is done effectively, there is a deep understanding of every relevant variable that can influence the plant’s overall reliability and the goal of determining what can be done to maintain an optimized level of reliability.
Considering these two variables of risk, the probability that a lubricated machine will fail is irrelevant if there is no consequence. Similarly, the consequence of failure is irrelevant if the machine cannot fail, theoretically speaking. The two must be understood to manage the risk. When putting this into practice, here are some questions that might be asked to identify the risk associated with lubricated machines.
The Probability of Failure
-
How well are the lubricated machines equipped with sensors, sight glasses, oil analysis and other feedback indicators to alert lubrication technicians of a potential failure mode?
-
How well-equipped are lubrication technicians with the right tools and software?
-
How well-trained are the lubrication technicians in lubrication best practices?
-
How experienced are the lubrication technicians in observing and mitigating lubrication failure modes with proactive maintenance and predictive maintenance strategies?
-
How is the culture of the lubrication team in promoting good lubrication practices and aiming towards continuous improvement?
The Consequence of Failure: If the lubricated machine were to experience a failure...
-
How much does the replacement part cost? What is the availability of this replacement part?
-
How much labour would be required? Can this work be easily scheduled?
-
What is the time before this would impact production? What is the cost of this lost production? Are there backups to this machine?
-
What other machines could be damaged or also experience a failure (as a chain reaction of failures)?
-
Is the safety of anyone compromised? If so, in what way?
-
Is the impact on the environment compromised? If so, in what way?
Pareto Principle
Have you ever taken a multiple-choice test where more than one possible answer was correct, but to get credit for the answer you had to choose the most correct answer? These types of questions are often missed; not because the test taker didn’t know the answer but because the test taker was quick to select a “right answer” before reading all other options that may have included an even better answer.
It is not always easy to know what the right actions should be to improve the reliability of your lubricated equipment. Often times daily work tasks are decided based on a quick justification that action will provide benefit. While this is good, is it the best, most beneficial action to take right now? Consider the Pareto Principle, often called the 80:20 rule. This principle suggests that about 80% of the effects are due to about 20% of the causes. For example, if we analyze all the possible causes of a machine failure, there is a select 20% of those that are the root cause for roughly 80% of the occurrences of failure. That 20% are called the “critical few”. If each reliability effort, including your time today, is focused on addressing the critical few, then in theory those actions are the most important things that one can work on today. But first, all the possible actions must be analyzed before knowing which one to select as the best action. Just like the multiple-choice question.
But there is one problem... what exactly are these “answers”? What are those “critical few”? For lubricated equipment, this might simply mean focusing on a specific set of machines deemed as the most critical in the plant. Or when looking at root cause concerns, adjusting plant-wide reliability focus to target a certain type of contaminant that is known to cause the largest number of contamination related failures. Fortunately, many of these justifications are well documented for the general case. For example, the critical few reasons why bearings fail are improper lubrication and contamination. More specifically, an article from Machinery Lubrication magazine in 2009 titled “Lubricant Failure = Bearing Failure” identified how improper lubrication failure is responsible for 40-50% of bearing failure. This includes lubrication failure modes such as:
-
Improper lubricant selection, such as incorrect viscosity selection or grease thickener type
-
Improper lubricant quantity, such as over or under-greasing a bearing
-
Contamination, such as dirt ingression from bearing seals
-
Excessive temperatures, as a root cause, or part of a symptom from a different lubricant failure mode
Reliability-centered maintenance uses at risk management as a reliability tool. The lubrication-related risk identified for machines helps justify a cost-effective strategy to avoid critical equipment from experiencing costly failure modes. Once the lubrication failure modes are identified, the following questions need to be assessed to ultimately rank their associated risk to the plant:
-
Which assets exhibit the greatest potential consequence if a failure event occurs?
-
What is the relative severity of the consequences?
-
How likely are these failure modes to occur?
-
What can be done to prevent these events?
When following RCM, the three phases (Decide, Analysis, Act) provide a structure to be followed during implementation. In a recent Reliable Plant article on “How to Implement Reliability- Centered Maintenance" by Jonathan Trout, he best describes some important aspects on how to implement reliability-centered maintenance.
Failure Patterns
Machine failures can often seem random. This can seem even more true when analyzing a single failure event or looking at only fragments of information on failure events. Being effectively proactive about avoiding machine failures requires a view into the future. Maybe not literally, but creating a prediction of a future failure on your equipment can be greatly enhanced by understanding the patterns of the past and envisioning those patterns extend onwards into the future. Some patterns can be easy to spot, but in many cases the many known and unknown variables that go into a potential failure mode make it difficult.
Recognizing failure patterns for lubricated machines has been done for decades and proved instrumental in improving reliability from the design improvements, better identifying operating limitations and including maintenance recommendations. Some patterns are straight forward, such as a steadily rising failure rate that increases as time progresses. Others create predictions based on a known time where the likelihood of the failure will spike dramatically, known as the time-based wear-out.
The bathtub curve is an often discussed failure curve that incorporates a few different failure patterns together. On the front end is the possibility of infant mortality, which suggests machines have a higher propensity to fail upon startup due to manufacturing defects, installation errors, initial lack of internal cleanliness or other built-in problems. Once a certain time has passed, the failure due to one of these causes drops dramatically. This middle period can be a long time and if a failure were to exist, it may be due to operational factors or uncontrolled contaminant exposures. On the back end of this curve, towards the end of life, the machine may once again experience an increase in failure rate. This can be due to an expected time-based wear-out or the steadily rising failure rate eventually.
The credibility of a failure pattern is greatly enhanced when sourcing a large set of data from similar machines with similar conditions. The failure rate is defined as the number of failures divided by the total running time. This is the inverse of a similar commonly used term, mean time between failure, or MTBF. The better the quality and greater the quantity the data, the more effective tool these failure patterns can be. When determining the probability of failure for a specific machine, the best data comes from analyzing the past failure from similar machines at the same plant under similar conditions. One effective technique used to analyze failure analysis data is called Weibull analysis or Weibull distribution. The steps of this technique start by creating a plot of past failure and/or current operation data on a Weibull graph (special log-based graph paper) to determine a shape parameter, β. This shape parameter can then be used to visualize different failure rate curves that represent either an increasing failure rate (β > 1) or decreasing failure rate (β < 1). Some examples of some failure rate curves to graphically visualize failure or life characteristics of the machine are:
-
Weibull probability density curve – a curve that is generally shaped like a bell curve representing the probability of failure (y-axis) to occur a specific period of time (x-axis).
-
Cumulative Density Function – a curve that is generally increasing over time that shows the probability of failure (y-axis) to occur before a period of time (x-axis).
-
Weibull reliability – a curve is generally decreasing over time that shows the change in reliability (y-axis) during a period of time (x-axis).
-
Weibull hazard rate – a curve that is generally increasing representing the increasing failure hazard (y-axis) over a period of time (x-axis) that is also often considered the failure rate.
One of the greatest benefits of analyzing the Weibull distribution of a failure mode is determining the risk level associated to a lubricated machine, even though the failure rate is constant for a period of time. The failure curves are often determined using empirical data, but when analyzing the actual cause of failure, then one must look at influencing variables such as those related to operating conditions, environmental conditions, maintenance history, or machine design defects. Having an analysis technique on these possible contributing factors is necessary.
FMEA
Failure mode effects analysis, or FMEA, is a methodology developed over 50 years ago by NASA engineers during the space race. Their design practices had the goal of making every proactive effort to investigate all the possible ways the vessel taking man to the moon could fail and what could be done to avoid the failure. This methodology has since been used across industries to improve efficiencies in mechanical designs by using a thorough review process to identify every possible failure mode. During this design stage, lubricated equipment becomes more reliable before it is even in service. Failure modes considered are a function of more than just operating characteristics such as load and speed, but also the possible effects from environmental factors such as solid contaminants, moisture, external temperature extremes and even the possibility of misusing the equipment intentionally or unintentionally. In short, this becomes a review of the adage of Murphy’s Law, determining what can go wrong, because ultimately it may very well be the reason for it to fail.
The result of performing an FMEA is about more than just identifying failure modes. This process considers the whole thought process that is required in reviewing the characteristics of each failure events, including:
1. Defining the asset functions.
- Primary functions related to operational productivity.
- Secondary functions related to safety and environmental impacts.
2. Defining the state of functional failure for the asset.
- What constitutes a failure state?
- What are the events the lead up to that failed state? (such as chain reaction)?
3. Identifying the specific effects from the failure mode.
- What is the evidence of a failure mode?
4. Identifying the root causes of the failure mode.
- What was the initial event that started the chain reaction leading up to the failure?
5. Identifying the nature of each consequence as a result of a failure
- What is the evidence of a failure mode?
- Production, safety, environmental, hidden, secondary chain reactions, etc.
6. Identifying the risk related to each consequence
- What is the severity?
- What is the likelihood?
- What is the current state of detectability?
Each possible failure mode is analyzed under the scrutiny of these questions. Consider a lubrication-related failure mode, such as abrasive wear on a ball bearing. We must ask questions such as what was the reason why the lubricant failed to protect the bearing? Was it related to the lubricant properties or what is contaminant induced? Does this bearing failure result in a significant set of consequences or is this bearing on a non-critical asset?
Once each failure mode is well characterized and well understood, then a risk assessment can be developed and used to prioritize the actions (if any) that need to be taken. Conclusively, this prioritization will justify the focus on what lubrication concerns are most important to address and in what order.
Hazard Analysis Critical Control Points
Another consideration when managing risk is the steps to routinely inspect machines for potential hazards. Hazard Analysis Critical Control Points, or HACCP, is a technique used in the food safety industries to do just this. HACCP is a program that focuses on examining critical control points in the production cycle of a food processing facility and determining if there is any known or reasonably foreseen hazards, such as health and safety risks, that could occur with the current configuration. Observations are made to consider any food contamination points throughout the process as well as any allergens that could cross-contaminate foods and any sanitation risks. The HACCP program is monitored routinely and written down in detail to provide confidence of thoroughness. Programs like these have been successful in food processing facilities.
Risk: A Call for Action
Risk is all about analyzing the probability of failure and the consequences that result from a failure. A failure does not have to be a full seizure of a machine, it can simply be a reduction in intended function due to an expected cause. Or it can be the development of a health or safety concern. Lubricated machines are often of critical concern and potentially high risk due to the nature of mechanical design and vast use in industrial facilities. Even the use of lubricant itself can pose a risk to the surroundings, either to the health of those around or contamination to the environment. Ultimately, identifying what can be controlled and should be controlled may be crucially valuable in improving the reliability of machines and overall wellbeing. Managing this risk by taking these key proactive and preventative actions to minimize or eliminate these risks is our call for action.