Tag: DevOps

Site Reliability Engineering – Failure Mode and Effects Analysis (FMEA)

Failure Mode and Effects Analysis (FMEA) is a tool for identifying potential problems and their impact.

Here are some of the benefits:

– Organized way to fully identify and qualify all potential problems with your services. That will help you drive your conversations during service workshops with service owners
– Use the “Risk Priority Number” to help you prioritize the things you need to monitor first
– Having all potential problems listed will help not only the SRE team but also the Development team to think on how to avoid such issues during development. Finding problems at this point in the cycle can significantly reduce costs and avoid delays to schedules
– Increase product quality and reliability

Creating a FMEA

When completing an FMEA, it’s important to remember Murphy’s Law: “Anything that can go wrong, will go wrong.” Participants need to identify all the components, systems, processes and functions that could potentially fail to meet the required level of quality or reliability. The team should not only be able to describe the effects of the failure, but also the possible causes.

Here is the fields to be filled in for each potential issue:

Function or Process Step: Briefly outline function, step or item being analyzed
Failure Type: Describe what can go wrong
Potential Impact: What is the impact on the key output variables or internal requirements?
SEV How severe is the effect to the customer? (more details in the following sections)
Potential Causes: What causes the key input to go wrong?
OCC: How frequently is this likely to occur? (more details in the following sections)
Detection Mode: What are the existing controls that either prevent the failure from occurring or detect it should it occur?
DET: How easy is it to detect? (more details in the following sections)
RPN: Risk priority number (more details in the following sections)
Recommended Actions: What are the actions for reducing the occurrence of the cause or improving the detection?
Responsibility: Who is responsible for the recommended action?
Target Date: What is the target date for the recommended action?
Action Taken: What were the actions implemented? Now recalculate the RPN to see if the action has reduced the risk.

Severity (SEV), Occurrence (OCC) and Detection (DET)
Participants must set and agree on a ranking between 1 and 10 (1 = low, 10 = high) for the severity (SEV), occurrence (OCC) and detection level (DET) for each of the failure modes:

Description Low Number High Number
Severity Severity ranking encompasses what is important to the industry, company or customers (e.g., safety standards, environment, legal, production continuity, scrap, loss of business, damaged reputation) Low impact High impact
Occurrence Rank the probability of a failure occuring during the expected lifetime of the product or service Not likely to occur Inevitable
Detection Rank the probability of the problem being detected and acted upon before it has happened Very likely to be detected Not likely to be detected

Risk Priority Number (RPN)
After ranking the severity, occurrence and detection levels for each failure mode, the team will be able to calculate a risk priority number (RPN). The formula for the RPN is:

RPN = severity x occurrence x detection

Setting Priorities
Once all the failure modes have been assessed, your team should adjust the FMEA to list failures in descending RPN order. This highlights the areas where corrective actions can be focused. If resources are limited, practitioners must set priorities on the biggest problems first.

A starting point for prioritization is to apply the Pareto rule: typically, 80 percent of issues are caused by 20 percent of the potential problems. As a rule of thumb, the team can focus its attention initially on the failures with the top 20 percent of the highest RPN scores.


DevOps for offline products

It is much easier to talk about DevOps in companies in which their products are online but in companies in which most applications, if not all, run in the customer side and we don’t have access to them is very challenging.

One of the most important values of DevOps is to have a “Ops Ready” application. A “Ops Ready” application means that Operation team should be able to detect problems in it easily.

A “Ops Ready” application has a good way to be monitored after deployment so you can anticipate problems and avoid hearing from customers. A very good way is to have some kind of heartbeat in all system components which sends information about the system and the machine it is running on. In the end, you should figure a problem out instead of being told about it.

Although a very good idea, it is very hard (or impossible) to have this kind of heartbeat system when the application normally  runs offline. It is harder but we still need to have some way to track the problems down.

The solution for this problem depends on the type of application you are working on but you should create some kind of offline monitoring mechanism and have this data stored somewhere and maybe sent to you once a connection is available. Supporting an application if this kind of mechanism is much easier.

You should take privacy into account but this is a subject for a different post.