Mean Time Between Cyber Failures

05.08.2024

Cody Tews, Principal Engineer

Delivering reliability is a core tenet of the electric power industry. For more than 40 years, SEL has developed technology and products that further this goal. The opening section of the technical paper “Reliability Analysis of Transmission Protection Using Fault Tree Methods” [1] provides an introduction to reliability:

“Since reliability is the reciprocal of failure, and failure is a random event, probabilistic measures are most appropriate, and we apply the laws of probability theory.

“For example, suppose the reliability of a device is expressed with a mean-time-between-failures (MTBF) of 100 years. The failure rate is 1/100 failures per year. And, if a system has 300 of these devices, then we would expect 300 ∙ (1/100) = 3 device failures per year.

“We use the method of combining component failure rates called ‘fault tree analysis,’ a concept first proposed by H. A. Watson of Bell Telephone Laboratories to analyze the Minuteman Launch Control System....

“If a device consists of several components, then a fault tree helps us combine component failure rates to calculate the device failure rate. Refer again to our device that has a failure rate of 1/100 failures per year. It might consist of two components, each with a failure rate of 1/200 failures per year. Both components must operate properly for the device to be sound. The individual failure rates of the two components add up to the total failure rate of 1/100. We add the component failure rates to obtain the device failure rate if either component can cause the device to fail.

“On the other hand, our device with the 1/100 failure rate might consist of two redundant components each with a failure rate of 1/10 failures per year. Either component can give satisfactory performance to the device. The product of the individual component failure rates is the device failure rate. We multiply component failure rates to obtain the device failure rate if both components must fail to cause a device failure.”

This failure and repair cycle is expressed graphically in Figure 1 [2]. In this state diagram, a device is commissioned at Time t₀, and then after some time experiences a failure at Time t₁. The failure is repaired at Time t₂, and the device continues to operate until another failure occurs at Time t₃. In reliability engineering, the two key metrics are the time to repair (t₂– t₁) and time between failures (t₃– t₂). The fraction of time the system is not available is q = (t₂– t₁) / (t₃– t₂). Over a large sample set, the metrics can be averaged to derive a mean time to repair (MTTR) and MTBF.

Figure 1

Security and Reliability

Building on the functional reliability discussion, we now consider cybersecurity in this same framework. Instead of component failure and repair, we will consider security failure (incident) and security repair (mitigation).

Figure 2 illustrates this revised state behavior for cybersecurity. A system is securely commissioned at Time t₀. Then a security incident happens somewhere in the system at Time t₁, resulting in the system becoming insecure. Security mitigations are applied, and the system is once again secure at Time t₂. The system is secure until another security incident occurs at Time t₃. The mean time between security failures (MTBSF) is the population average of (t₂– t₃). The mean time to security repair (MTTSR) is the population average of (t₁– t₂).

Figure 2

In practice, the MTBSF and MTTSR metrics may provide a technology-agnostic method of characterizing device and system health. For example:

If a vendor constantly releases security updates, this will drive down the effective MTBSF rate. This may encourage a vendor to initially develop a more secure product.
If a vendor is slow to mitigate a vulnerability, this will drive up the effective MTTSR rate. This metric may encourage a vendor to minimize mitigation time.
These metrics can also be extended to the system level and may provide insight into aggregate security.

These metrics may drive desirable behavior. As the author Peter Drucker once wrote, “What’s measured improves.”

Contributor

Cody Tews

Principal Engineer, Government Services

cody_tews@selinc.com

View full bio

Additional Posts by Contributor

[1] E. O. Schweitzer, III, B. Fleming, T. Lee, and P. Anderson, “Reliability Analysis of Transmission Protection Using Fault Tree Methods,” proceedings of the 24th Annual Western Protective Relay Conference, Spokane, WA, October 1997.

[2] M. Rausand and A. Hoyland, System Reliability Theory: Models, Statistical Methods, and Applications, 2nd Ed., Wiley-Interscience, Hoboken, NJ, 2003, p. 368.

Contribute to the conversation

We want to hear from you. Send us your questions, thoughts on ICS and OT cybersecurity, and ideas for what we should discuss next.

White Paper

Leveraging the SEL Ecosystem for NERC CIP-015 Compliance

Mar 3, 2025

Read the article

Return to the Cybersecurity Center for more content.Return to main page

Mean Time Between Cyber Failures

Additional Posts by Contributor

Related post