Delivering reliability is a core tenet of the electric power industry. For more than 40 years, SEL has developed technology and products that further this goal. The opening section of the technical paper “Reliability Analysis of Transmission Protection Using Fault Tree Methods” [1] provides an introduction to reliability:
“Since reliability is the reciprocal of failure, and failure is a random event, probabilistic measures are most appropriate, and we apply the laws of probability theory.
“For example, suppose the reliability of a device is expressed with a mean-time-between-failures (MTBF) of 100 years. The failure rate is 1/100 failures per year. And, if a system has 300 of these devices, then we would expect 300 ∙ (1/100) = 3 device failures per year.
“We use the method of combining component failure rates called ‘fault tree analysis,’ a concept first proposed by H. A. Watson of Bell Telephone Laboratories to analyze the Minuteman Launch Control System....
“If a device consists of several components, then a fault tree helps us combine component failure rates to calculate the device failure rate. Refer again to our device that has a failure rate of 1/100 failures per year. It might consist of two components, each with a failure rate of 1/200 failures per year. Both components must operate properly for the device to be sound. The individual failure rates of the two components add up to the total failure rate of 1/100. We add the component failure rates to obtain the device failure rate if either component can cause the device to fail.
“On the other hand, our device with the 1/100 failure rate might consist of two redundant components each with a failure rate of 1/10 failures per year. Either component can give satisfactory performance to the device. The product of the individual component failure rates is the device failure rate. We multiply component failure rates to obtain the device failure rate if both components must fail to cause a device failure.”
This failure and repair cycle is expressed graphically in Figure 1 [2]. In this state diagram, a device is commissioned at Time t0, and then after some time experiences a failure at Time t1. The failure is repaired at Time t2, and the device continues to operate until another failure occurs at Time t3. In reliability engineering, the two key metrics are the time to repair (t2 – t1) and time between failures (t3 – t2). The fraction of time the system is not available is q = (t2 – t1) / (t3 – t2). Over a large sample set, the metrics can be averaged to derive a mean time to repair (MTTR) and MTBF.
Figure 1
Security and Reliability
Building on the functional reliability discussion, we now consider cybersecurity in this same framework. Instead of component failure and repair, we will consider security failure (incident) and security repair (mitigation).
Figure 2 illustrates this revised state behavior for cybersecurity. A system is securely commissioned at Time t0. Then a security incident happens somewhere in the system at Time t1, resulting in the system becoming insecure. Security mitigations are applied, and the system is once again secure at Time t2. The system is secure until another security incident occurs at Time t3. The mean time between security failures (MTBSF) is the population average of (t2 – t3). The mean time to security repair (MTTSR) is the population average of (t1 – t2).
Figure 2
In practice, the MTBSF and MTTSR metrics may provide a technology-agnostic method of characterizing device and system health. For example:
- If a vendor constantly releases security updates, this will drive down the effective MTBSF rate. This may encourage a vendor to initially develop a more secure product.
- If a vendor is slow to mitigate a vulnerability, this will drive up the effective MTTSR rate. This metric may encourage a vendor to minimize mitigation time.
- These metrics can also be extended to the system level and may provide insight into aggregate security.
These metrics may drive desirable behavior. As the author Peter Drucker once wrote, “What’s measured improves.”
Additional Posts by Contributor
[1] E. O. Schweitzer, III, B. Fleming, T. Lee, and P. Anderson, “Reliability Analysis of Transmission Protection Using Fault Tree Methods,” proceedings of the 24th Annual Western Protective Relay Conference, Spokane, WA, October 1997.
[2] M. Rausand and A. Hoyland, System Reliability Theory: Models, Statistical Methods, and Applications, 2nd Ed., Wiley-Interscience, Hoboken, NJ, 2003, p. 368.
Contribute to the conversation
We want to hear from you. Send us your questions, thoughts on ICS and OT cybersecurity, and ideas for what we should discuss next.
Article
SEL Director of Security proposes ways to create a lasting culture of security as a first and last line of cybersecurity defense.
Article
Discover a key challenge facing the cybersecurity industry: a lack of understanding of cybersecurity’s first principles.