Site Reliability Engineering – Failure Mode and Effects Analysis (FMEA)

Failure Mode and Effects Analysis (FMEA) is a tool for identifying potential problems and their impact.

Here are some of the benefits:

– Organized way to fully identify and qualify all potential problems with your services. That will help you drive your conversations during service workshops with service owners
– Use the “Risk Priority Number” to help you prioritize the things you need to monitor first
– Having all potential problems listed will help not only the SRE team but also the Development team to think on how to avoid such issues during development. Finding problems at this point in the cycle can significantly reduce costs and avoid delays to schedules
– Increase product quality and reliability

Creating a FMEA

When completing an FMEA, it’s important to remember Murphy’s Law: “Anything that can go wrong, will go wrong.” Participants need to identify all the components, systems, processes and functions that could potentially fail to meet the required level of quality or reliability. The team should not only be able to describe the effects of the failure, but also the possible causes.

Here is the fields to be filled in for each potential issue:

Function or Process Step: Briefly outline function, step or item being analyzed
Failure Type: Describe what can go wrong
Potential Impact: What is the impact on the key output variables or internal requirements?
SEV How severe is the effect to the customer? (more details in the following sections)
Potential Causes: What causes the key input to go wrong?
OCC: How frequently is this likely to occur? (more details in the following sections)
Detection Mode: What are the existing controls that either prevent the failure from occurring or detect it should it occur?
DET: How easy is it to detect? (more details in the following sections)
RPN: Risk priority number (more details in the following sections)
Recommended Actions: What are the actions for reducing the occurrence of the cause or improving the detection?
Responsibility: Who is responsible for the recommended action?
Target Date: What is the target date for the recommended action?
Action Taken: What were the actions implemented? Now recalculate the RPN to see if the action has reduced the risk.

Severity (SEV), Occurrence (OCC) and Detection (DET)
Participants must set and agree on a ranking between 1 and 10 (1 = low, 10 = high) for the severity (SEV), occurrence (OCC) and detection level (DET) for each of the failure modes:

Description Low Number High Number
Severity Severity ranking encompasses what is important to the industry, company or customers (e.g., safety standards, environment, legal, production continuity, scrap, loss of business, damaged reputation) Low impact High impact
Occurrence Rank the probability of a failure occuring during the expected lifetime of the product or service Not likely to occur Inevitable
Detection Rank the probability of the problem being detected and acted upon before it has happened Very likely to be detected Not likely to be detected

Risk Priority Number (RPN)
After ranking the severity, occurrence and detection levels for each failure mode, the team will be able to calculate a risk priority number (RPN). The formula for the RPN is:

RPN = severity x occurrence x detection

Setting Priorities
Once all the failure modes have been assessed, your team should adjust the FMEA to list failures in descending RPN order. This highlights the areas where corrective actions can be focused. If resources are limited, practitioners must set priorities on the biggest problems first.

A starting point for prioritization is to apply the Pareto rule: typically, 80 percent of issues are caused by 20 percent of the potential problems. As a rule of thumb, the team can focus its attention initially on the failures with the top 20 percent of the highest RPN scores.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s