Center for Reliable Computing
TECHNICALREPORT
A Design Diversity Metric and Analysis of Redundant Systems
Subhasish Mitra, Nirmal R. Saxena and Edward J. McCluskey
Preliminary Version(CSL TR # 799)September, 1999Abstract:99-4Center for Reliable ComputingGates Building 2A, Room 236Computer Systems LaboratoryDept. of Electrical Engineering and Computer ScienceStanford UniversityStanford, California 94305-9020Design diversity has long been used to protect redundant systems against common-modefailures. The conventional notion of diversity relies on “independent” generation of“different” implementations. This concept is qualitative and does not provide a basis tocompare the reliabilities of two diverse systems. In this paper, for the first time, we present ametric to quantify diversity among several designs. Based on this metric, we derive analyticalreliability models that show a simple relationship among design diversity, system failure rate,and mission time. We also perform availability analysis of redundant systems using ourmetric. In addition, we present simulation results to demonstrate the effectiveness of designdiversity in duplex systems. For common-mode failures and design faults, there is asignificant gain in using different implementations — however, as our analysis shows, thegain diminishes as the mission time increases. For independent multiple-module failures, weshow that, mere use of different implementations does not always guarantee higher reliabilitycompared to redundant systems with identical implementations — it is important to analyzethe reliability of redundant systems using our metric. Our simulation results also demonstratethe usefulness of diversity in enhancing the self-testing properties of redundant systems.Funding:This work was supported by the Advanced Research Projects Agency under primecontract No. DABT63-97-C-0024.Imprimatur: Philip Shirvani and Santiago Fernandez-Gomez
Copyright © 1999 by the Center for Reliable Computing, Stanford University.
All rights reserved, including the right to reproduce this report, or portions thereof, in any form.
PRELIMINARY VERSION
1
PRELIMINARY VERSION
A Design Diversity Metric and Analysis of Redundant Systems
Subhasish Mitra, Nirmal R. Saxena and Edward J. McCluskey
CRC Technical Report No. 99-4
(CSL TR No. ??)May 1999
Center for Reliable ComputingComputer Systems Laboratory
Departments of Electrical Engineering and Computer Science
Stanford University, Stanford, California 94305
Abstract
Design diversity has long been used to protect redundant systems against common-modefailures. The conventional notion of diversity relies on “independent” generation of“different” implementations. This concept is qualitative and does not provide a basis tocompare the reliabilities of two diverse systems. In this paper, for the first time, we present ametric to quantify diversity among several designs. Based on this metric, we derive analyticalreliability models that show a simple relationship among design diversity, system failure rate,and mission time. We also perform availability analysis of redundant systems using ourmetric. In addition, we present simulation results to demonstrate the effectiveness of designdiversity in duplex systems. For common-mode failures and design faults, there is asignificant gain in using different implementations — however, as our analysis shows, thegain diminishes as the mission time increases. For independent multiple-module failures, weshow that, mere use of different implementations does not always guarantee higher reliabilitycompared to redundant systems with identical implementations — it is important to analyzethe reliability of redundant systems using our metric. Our simulation results also demonstratethe usefulness of diversity in enhancing the self-testing properties of redundant systems
2
PRELIMINARY VERSION
TABLE OF CONTENTS
1. Introduction .................................................................................................................12. Design Diversity Metric and Reliability Analysis ......………………........................33. Example ……………………………………………..................................................124. A Simulation-Based Approach ………………………...............................................135. Self-Testing Property ..................................................................................................176. Diversity Advantages in Configurable Systems .....................……............................187. Conclusions .................................................................................................................208. Acknowledgments .......................................................................................................219. References ..............................................................…..................…...........................2210. Appendix 1: Reliability Analysis …………………………………………………...24
LIST OF FIGURES
Figure 1.1. A Duplex redundant system …….................................................................1Figure 2.1 A discrete time model of the system ...…........………….........................….5Figure 2.2. Faults affecting modules simultaneously …………….…………………....5Figure 2.3. Fault-Secure probability of duplex systems ...………..................................7Figure 2.4. Fault-secure probability of duplex systems (common-mode failures) ……8Figure 2.5. Effect of diversity with mission time ………………………………...……8Figure 2.6. Markov chain for availability analysis ...........................….....…................11Figure 2.7. Comparison of availability of duplex systems ..............…………………..12Figure 3.1. An example logic circuit ………………………………………………….13Figure 6.1. Reconfigurable computing test-bed ……………………………………….18Figure 6.2. Results from experiments on a configurable computing test-bed …………19Figure 6.3. Results from experiments on a configurable computing test-bed …………20
3
PRELIMINARY VERSION
Table 2.1.Table 4.1.Table 4.2.Table 4.3.Table 5.1.
LIST OF TABLES
Behavior of faulty multiple-output circuitsCharacteristics of simulated designsSimulation 1 resultsSimulation 2 results
Self-testing properties of diverse and non-diverse duplex systems
414151618
4
PRELIMINARY VERSION
5
PRELIMINARY VERSION
1. INTRODUCTION
The use of redundancy techniques for designing systems with high data-integrityand availability has been studied extensively [Siewiorek 92][Pradhan 96]. A duplexsystem in the form of a self-checking pair is an example of a classical redundancyscheme (Fig. 1.1). As long as only one module fails, the system either produces correctresults or indicates error situation.Module 1Module 2ComparatorErrorFigure 1.1. A Duplex Redundant System
In a redundant system, common-mode failures (CMFs) result from failures thataffect more than one module at the same time, generally due to a common cause. Theseinclude operational failures that may be due to external (such as EMI, power-supplydisturbances and radiation) or internal causes. In addition to these common-modefailures, with the increasing complexity of the various designs, design mistakes arebecoming very significant. It has been pointed out in [Avizienis 84], although the use ofredundant copies of hardware has proven to be quite effective in the detection of physicalfaults and subsequent system recovery, design faults are reproduced when redundantcopies are made. Simple replication fails to enhance the system reliability against designfaults.
Design diversity has been proposed in the past to protect redundant systems againstcommon-mode failures. In [Avizienis 84], design diversity was defined as theindependent generation of two or more software or hardware elements (e.g., programmodules, VLSI circuit masks, etc.) to satisfy a given requirement. Design diversity wasalso proposed in [Lala 94] as an avoidance technique against common-mode failures.Design diversity has been applied to both software and hardware systems. N-versionprogramming [Avizienis 77, Lyu 91] is used to achieve diversity in software systems.Hardware design diversity has been used in the Primary Flight Computer (PFC) systemof Boeing 777 [Riter 95] and many other commercial systems [Briere 93]. For theBoeing 777, three different processors (from AMD, Intel and Motorola) are used in the1
PRELIMINARY VERSION
PFC. Tohma proposed to use the implementations of logic functions in true andcomplemented forms during duplication [Tohma 71]. The use of a particular circuit andits dual was proposed in [Tamir 84] to achieve diversity in order to handle common-modefailures. The basic idea is that, with different implementations, common failure modeswill probably cause different error effects.
Design diversity can prove to be useful in the context of dependable AdaptiveComputing Systems (ACS). The field programmability of Field Programmable GateArrays (FPGAs) can be utilized to achieve diversity among the different modules. In anACS environment, we can create diversity by synthesizing and downloading differentimplementations into FPGAs at any time. Thus, there is no need to manufacture multiplediverse ASICs. In order to quantify the effect of diversity on the reliability of aredundant system, a metric is needed to quantify diversity among designs with the samespecification [Tamir 84].
In addition to common-mode failures, with the high density of logic gates in a VLSIchip, multiple failures may become more frequent. For example, current research showsthat multiple-event upsets (possibly due to a single radiation source) are common inVLSI chips [Liu 97][Reed 97]. The classical reliability models of redundant systems arepessimistic because, in the presence of multiple module failures, they do not considercompensating effects of different faults [Siewiorek 75]. It is interesting to find outwhether design diversity also helps in achieving better compensating effects of differentfaults, compared to simple replication.
In this paper, we address problems related to design diversity and examine their effectson the reliability of a redundant system. Some preliminary ideas related to this workwere reported in [Saxena 98] and [Mitra 99]. Our main contributions are: (1) developinga metric to quantify diversity among several designs; and (2) using this metric to performreliability and availability analysis of redundant systems. In Sec. 2, we introduce adesign diversity metric and perform reliability and availability analysis of redundantsystems using this metric. Section 3 presents some preliminaries related to the stuck-atfault model and illustrates our analysis with the help of an example. We presentsimulation results in Sec. 4. Section 5 examines the effect of design diversity on the self-testing properties of a duplex system. We present experimental results demonstrating theadvantages of using design diversity in configurable systems in Sec. 6. Finally, weconclude in Sec. 7.
2
PRELIMINARY VERSION
2. Design Diversity Metric And Reliability Analysis
2.1. D: A Design Diversity Metric
In this section, we introduce a metric to quantify diversity among several designs.We define the metric for a system with two designs implementing the same function.The metric has application in estimating the reliability of NMR systems with maskingredundancy. Before defining the diversity metric, we first define the notion of diversitybetween two implementations with respect to a fault pair.
For two designs implementing the same function, the diversity with respect to afault pair (fi, fj), di,j, is the probability that the designs do not produce identical errorpatterns, in response to a given input sequence, when fi and fj affect the first and thesecond implementations, respectively.
For a given fault model, the design diversity metric, D, between two designs is theexpected value of the diversity with respect to different fault pairs. Mathematically, wehave D = ∑P(fi,fj)di,j, where P (fi, fj) is the probability of the fault pair (fi, fj).
(fi,fj)
D is the probability that, in response to a given input sequence, the twoimplementations either produce error-free outputs or produce different error patterns ontheir outputs.
Example: Consider any combinational logic function with n inputs and a single output.The fault model considered is such, that a combinational circuit remains combinational inthe presence of the fault. Let us consider two implementations (N1 and N2) of the givencombinational logic function.
The joint detectability, ki,j, of a fault pair (fi, fj) is the number of input patterns that detectboth fi and fj. This definition follows from the idea of detectability developed in[McCluskey 88].
If we assume that all the input patterns are equally likely, then we can write di,j =ki,j1 -n.2
The di,j’s generate a diversity profile for the two implementations with respect to afault model. Consider a duplex system consisting of the two implementations underconsideration. In response to any input combination, the implementations can produceone of the following cases at their outputs. (1) Both of them produce correct outputs. (2)One of them produces correct output and the other produces incorrect output. (3) Both ofthem produce the same incorrect value.
For the first case, the duplex system will produce correct outputs. For the secondcase, the system will report a mismatch so that appropriate recovery actions can be taken.However, for the third case, the system will produce an incorrect output without reporting3
PRELIMINARY VERSION
a mismatch — thus, for the third case, the integrity of the system is lost due to thepresence of faults in the two implementations. In the literature on fault-tolerance[Siewiorek 92][Pradhan 96], this system integrity has been referred to as the fault-secureproperty.
The quantity di,j is the probability that a duplex system, having twoimplementations of the logic function under consideration, is fault-secure when faults fiand fj affect the first (N1) and the second (N2) implementations, respectively.
If we assume that all fault pairs are equally probable and there are m fault pairs (fi,
1
fj), then the D metric for the two implementations is: D = ∑di,j.
mi,jWe extend the above example to consider multiple-output combinational logiccircuits. For a fault pair (fi, fj) affecting the two implementations, we define kij as thenumber of input patterns, in response to each of which, both the implementations producethe same erroneous output pattern. We can use the same formulas as the single outputcase.
Inputs00011011Table 2.1. Behavior of faulty multiple output circuits
Fault-freeFaulty outputsFaulty outputsoutputs(Implementation 1)(Implementation 2)0 10 01 01 01 01 00 01 01 01 11 01 0For example, consider a combinational logic function with two inputs and two
outputs (Table 2.1). Suppose that, faults fi and fj affect the first and the secondimplementations, respectively. The responses of the two implementations in the presenceof the faults are shown in Table 2.1. The faulty output bits are highlighted in the thirdand fourth columns of Table 2.1. It is clear that for the calculation of kij, we have toconsider only the input patterns 10 and 11.
The above illustration of the design diversity metric can also be extended tosequential circuits and software programs. For small or medium-sized systems, the exactvalue of the diversity metric can be calculated manually or using computer programs.For large systems, the value can be estimated by using simulation techniques.
For two identical implementations of the same function, a common-mode failure(e.g., a design mistake) can be modeled as the same fault fi affecting the twoimplementations. Let m be the number of input sequences for which these twoimplementations produce identical error patterns at the outputs. If the secondimplementation is different from the first, for any fault fj affecting the second
4
PRELIMINARY VERSION
implementation (and fi affecting the first implementation), we cannot have more than minput sequences that produce identical error patterns at the outputs of the twoimplementations. Hence, di,i ≤ di,j. This property is useful for enhancing the reliability ofa redundant system against common-mode failures by using diversity.
2.2. Reliability Analysis
In this section, we calculate the reliability of duplex systems using the diversitymetric described in Sec. 2.1. We define the reliability of a duplex system as theprobability that the system is fault-secure. The reliability calculation is independent ofwhether the redundant components are exact replicas or different implementations. Weassume a discrete time model for the system. In such a model, the time axis is broken upinto discrete time cycles and we apply inputs and observe outputs only at cycleboundaries.
As shown in Fig. 2.1, input combination (vector) vi is applied at the beginning ofthe ith cycle. Also, in Fig. 2.1, the first system becomes faulty (f1) during cycle i and thesecond system becomes faulty (f2) during cycle j. Let p be the probability that aparticular module is affected by a fault at any cycle. For simplicity, we assume that thisprobability p is the same for all the modules in the system at all times. The probability pcan be looked upon as the failure rate per cycle.
v1vif1vjf201iTimejtFigure 2.1. A discrete time model of the system
For a given fault pair (f1, f2), there are two possible cases. In the first case, boththe faults appear in the same cycle. The situation is shown in Fig. 2.2.
f1f2d1,201iTimeFigure 2.2. Faults affect the modules simultaneously
d1,2i+1td1,25
PRELIMINARY VERSION
In Fig. 2.2, faults f1 and f2 affect modules 1 and 2, simultaneously at cycle i. Itmay be argued that if a random fault appears in a particular module, then chances arehigh that the second fault will also appear in that same module. However, we do notassume any such correlation in this paper. At time 0, everything is fault-free. So, beforecycle i the system will produce correct results. However, starting from time i, in eachcycle, the system will produce correct results with the probability equal to d1,2. Theprobability s1(f1, f2, t) that the system is fault-secure up to time t, even in the presence ofthe two faults f1 and f2, is given by:
t[d1,2−(1−p)2t]2
s1(f1, f2, t) = pd1,2
[d1,2−(1−p)2]The derivation of the above expression is shown in the appendix. Next, weconsider the case where f1 and f2 appear at different cycles.
As discussed earlier, in Fig. 2.1, Module 1 becomes faulty during cycle i andModule 2 becomes faulty during cycle j. It is clear that up to time j, a duplex system willbe fault-secure. Hence, starting from time j, the system will be fault-secure withprobability d1,2. Thus, the probability s2(f1, f2, t) that the system is fault-secure up totime t, in the presence of the two faults f1 and f2 is given by the following equation.
t−1
−(1−p)2t−2]222[d1,2
s2(f1, f2, t) =(1−p)pd1,2−
(d1,2−1+p)[d1,2−(1−p)2]2
(1−p)tpd1,2[1−(1−p)t−1]
(d1,2−1+p)The derivation of the second case is also shown in the appendix. This case ismore complicated than the first case and is useful when we consider random independentfaults in multiple modules. We have:
s(f1, f2, t) = s1(f1, f2, t) + s2(f1, f2, t)
Here s(f1, f2, t) is the probability that a duplex system is fault-secure up to time t,when Module 1 is affected by fault f1 and Module 2 by fault f2.
We can characterize a duplex system using our diversity metric. In the followingcalculations, we assume that once a module becomes faulty, no other fault appears in thatmodule. This assumption is simplistic and allows us to obtain closed-form reliabilityexpressions. We calculate the probability that, up to time t, a duplex system is fault-secure. It is given by the following expression:
(1−p)2t+2(1−p)t[1−(1−p)t]+∑P(f1,f2)s(f1,f2,t)
f1,f2
The above expression follows from the fact that, in a duplex system, when noneof the modules fails the system produces correct outputs. When only one of the modules
6
PRELIMINARY VERSION
fails (due to single or multiple faults), the system is fault-secure. When both modules arefaulty, then we have to consider the d1,2 value for the fault pair (f1, f2) in the twomodules. P(f1, f2) is the probability that faults f1 and f2 appear in modules 1 and 2,respectively.
Mission Time (MTTF of Simplex)Prob. faultsecureClassicald1,2=1-10-11d1,2 =1-10-10Figure 2.3. Fault-secure probability of a duplex system with multiple independent failures
In Fig. 2.3, for a given pair of faults (f1, f2), we show the plots of the aboveexpression for different values of d1,2. The mission time is shown along the X-axis —the MTTF (Mean Time To Failure in cycles) of a simplex system corresponds to 1 timeunit. The probability that a fault appears in one cycle is 10-12. Along the Y-axis, weshow the probability that the duplex system is fault-secure. The classical analysis ofduplex systems is pessimistic since it assumes that the system ceases to be fault-securewhen two modules are faulty.
The above expressions can be modified for common-mode failures (CMF). Theprobability that a duplex system is fault-secure against common-mode failures up to timet, is given by the following expression:
(1−p)t+∑P(f1,f2)z(f1,f2,t)
f1,f2
Here, p is the probability that a CMF affects the two modules. In the aboveexpression, z(f1, f2, t) is given by the following formula:
z(f1, f2, t) = pd1,2
[d1,2−(1−p)]
[d1,2t−(1−p)t]
The above expression is maximized when d1,2 is of the order of (1-p). Thissuggests that, for a common-mode failure that can be modeled as fault pair (f1, f2), wecan obtain appreciable reliability improvement over classical systems when the value ofd1,2 is of the order of (1-p). The following observations can be derived from this
7
PRELIMINARY VERSION
relationship.
9. When the failure rate is high, even a small diversity can help enhance the systemreliability over traditional replication.
10. If the failure rate is low, then d1,2 must be extremely high for appreciable reliabilityimprovement over classical systems. As a limiting case, consider the situation whenthe CMF failure rate is 0. In that case, diversity will not buy us any extra reliabilityagainst CMFs.
Mission Time (MTTF of Simplex)ClassicalProb. faultsecured1,2 = 1-10-12d1,2 = 1-10-11Figure 2.4. Fault-secure probability of a duplex system against common-mode failures
In Fig. 2.4, for a given pair of faults (f1, f2), we show the plots of the above fault-secure probability expression for the different values of d1,2. The failure rate per cycle is10-13. It is clear that we get appreciable improvement in reliability (over classicalsystems) when the value of d1,2 is very high (1-10-12 or more). When the value of d1,2 isless than 1-10-12, we do not see high reliability improvement.
GainMission Time (MTTF of Simplex)Figure 2.5. Effect of diversity with mission time (for common-mode failures)
8
PRELIMINARY VERSION
In Fig. 2.5, we show how the reliability improvement obtained from diversitydepends on mission time. On the Y-axis of the graph in Fig. 2.5, we plot the ratio of thefollowing two quantities.
9. The probability that a duplex system is not fault-secure at time i, for a fault pair (f1,f2) with d1,2 = 1-10-11.
10. The probability that a duplex system is not fault-secure at time i, for fault pair (f1, f2)with d1,2 = 1-10-12. The failure rate per cycle is 10-13.We call this ratio the gain. On the X-axis, we plot the mission time. As Fig. 2.5shows, the gain diminishes with longer mission times. This analysis allows us to deriverelationships between the reliability of a redundant system, the diversity incorporated toprotect the system against common-mode failures and the mission time. The relationshipbetween diversity and mission time can also be used to determine checkpoint intervals ina redundant system. For example, referring to Fig. 2.5, we can checkpoint the state of thesystem when the gain is close to 1. Thus, our design diversity metric is a veryfundamental property and can be used to understand different trade-offs associated withthe design of dependable systems using redundancy.
Next, we estimate the error latency using our design diversity metric. Consider aduplex system with two implementations N1 and N2 of the same logic function. Let ussuppose that the faults f1 and f2 affect the two implementations at cycle c. The errorlatency is defined to be the number of cycles from c after which both the implementationsproduce the same error pattern at the output. For more discussions on error latency, thereader is referred to [Shedletsky 76]. The probability that the error latency is t (t > 0) is
t−1
given by: d1,2(1−d1,2). Here, the assumption is that d1,2 value is strictly less than 1. Ifthe d1,2 value is equal to 1, then the error latency is always equal to T, the mission time.The expected error latency is given by the following formula:
t,(f1,f2),d1,2≠1
t−1
(1−d1,2)+∑P(f1,f2)td1,2
(f1,f2),d1,2=1
∑P(f1,f2)T
From this expression, it is clear that for long mission times (i.e., large values of t),
the probability value approaches to 0 when the d1,2 value for the fault pair is less than 1.Thus, the fault pairs which have their di,j values equal to 1 (i.e., the compensating faultpairs) play a dominant role in determining the error latency for long mission times.Hence, the value of the expected error latency is determined by the percentage ofcompensating fault pairs. Simplification of the above expression produces the followingexpression for the expected latency of a duplex system in terms of the diversity metricswith respect to the different fault pairs.
9
PRELIMINARY VERSIONP(f1,f2)
+∑P(f1,f2)T∑(1−d)(f1,f2),d1,2≠1(f1,f2),d1,2=11,2
Expected error latency =
Consider the case of design mistakes that are special cases of common-mode
failures. For these cases, the fault is always present. Simple analysis reveals that theprobability that a duplex system is fault-secure up to time t, in the presence of designmistakes, is:
t
∑P(f1,f2)d1,2
f1,f2
Thus, for design mistakes, for a given fault pair (f1, f2), the more the value of d1,2,the more is the system reliability. This implies that, for design mistakes, diversity amongthe two implementations in a duplex system helps to increase the probability that thesystem is fault-secure.
While diversity in hardware designs is the main focus of this paper, the aboveideas can be extended to analyze diversity in software modules. For estimating thediversity metric for software modules, we need to have a fault model for the softwareunder consideration. Considering the range of values the input variables to the softwaremodule can possibly take, it may be difficult to compute the exact value of the metric.However, the value of the metric can be estimated using simulation techniques. Notethat, our observations about the relationships between diversity, mission time and failurerate still hold for software systems. One key feature of our analysis technique is that, it ispowerful, but at the same time, simpler than the models in [Eckhardt 85], [Tomek 93] and[Lyu 95].
2.3. Availability Analysis
In this section, we perform availability analysis of duplex systems with repaircapabilities using our diversity metric. For the purpose of our analysis, we assume that pis the probability that a (common-mode) failure affects the system during a particularcycle. The failure can manifest as fault f1 and f2 affecting Module 1 and Module 2,respectively. In our analysis, we use the following quantities, as described below.
The metric d1,2 is the probability that the two modules do not produce the sameerror pattern (at their outputs) in response to a given input sequence, when they areaffected by the faults f1 and f2.
We define another quantity, t1,2, which is the probability that the two modules donot produce any error at their outputs in response to a given input sequence, when theyare affected by the faults f1 and f2.
The quantity d1,2- t1,2 is the probability that the two modules will produce non-
10
PRELIMINARY VERSION
identical error patterns (at their outputs) in response to a given input sequence, when theyare affected by the faults f1 and f2.
The Markov chain used for our analysis is shown in Fig. 2.6. In the Markovchain, the system starts at the Good state. As long as a fault does not appear, the systemremains in the Good state. However, as soon as a fault appears, the system goes to theFaulty Correct state. The probability that both the modules produce correct outputs, inspite of the presence of the fault, is t1,2. The probability that the modules produceidentical errors at their outputs is 1- d1,2. Thus, with probability d1,2- t1,2, the modulesproduce non-identical erroneous — this means that the presence of the fault is detected.Once the fault is detected, the system enters the Repair state. We have assumed that the
1
expected number of cycles required to repair the system is . For modeling the repair
m
operation, we could as well use a repair rate. However, in the context of re-configurablesystems, we can have bounds on repair time, which we can use during the above Markovanalysis. The availability is given by the probability that the system is in the Good or theFaulty Correct state. In the following graph (Fig. 2.7), we show the dependence ofavailability on the values of d1,2 and t1,2. This analysis implications on the usefulness ofdiversity for enhancing the self-testing property and hence, the availability of duplexsystems. The analysis can be extended for other redundant systems (e.g., NMR systems).1-mRepair md1,2 - t1,21-pGoodp Faulty Correctt1,2Figure 2.6. Markov Chain for availability analysis
1 - d1,2FailIn Fig. 2.7, we plot the availability values for two duplex systems. Theprobability (p) that a fault pair appears in a particular cycle is 10-8. The number of repair
1
cycles () is 100. Both the systems have the value of d1,2 equal to 1-10-5 for a fault pair.
m
However, one of the systems (shown in Fig. 2.7) has the value of t1,2 equal to d1,2 and the
11
PRELIMINARY VERSION
other one has the t1,2 value equal to half of d1,2. As can be seen in Fig. 2.7, initially thesystem having t1,2 = d1,2 has a higher availability (since the probability that it stays in theFaulty Correct state is high). However, as time increases, the availability of the systemwith t1,2 = 0.5*d1,2 decreases at a much smaller rate compared to the system with t1,2 =d1,2. This is because, for the system with d1,2 equal to t1,2, there is no repair capability incontrast to the other system.
Mission Time (MTTF of Simplex)0.0020.9990.9980.0040.0060.0080.01t1,2 = 0.5*d1,2t1,2 = d1,2Availability
0.9970.996Figure 2.7. Comparison of the availability of duplex systems
We validate our observations using simulation data in Sec. 4. For simulationpurposes, we used the stuck-at fault model. In the next section, we introduce thepreliminaries related to the stuck-at fault model and illustrate the calculation of ourdiversity metric using an example.
3. Example
Research in the area of digital testing and diagnosis of combinational andsequential logic circuits has demonstrated the effectiveness of the logical stuck-at faultmodel. In this model, the failures in a logic circuit behave as if as some lines in thecircuit assume constant logical values, either 1 or 0, independent of the logic values onother lines of the circuit.
For the rest of this paper, we assume that all failures manifest as stuck-at faults inthe circuit. We also assume that the failures are permanent; i.e., if a stuck-at fault showsup at some time instant t, then the fault remains at all time instants greater than t. Forcircuits made from SRAM-based FPGAs, unless we re-initialize the SRAMs (reload agiven configuration), a transient fault in the configuration SRAM persists. Thus, theassumption of the permanent fault behavior is reasonable.
12
PRELIMINARY VERSION
For example, consider the network shown in Fig. 3.1. The function implementedby the network is wx + y. Consider a stuck-at-0 (s-a-0) fault on the line y, denoted by y/0.The function implemented by the network, in the presence of the fault, is wx. Thus, w =1, x = 0 and y = 1, when applied to the input of the logic circuit causes the faulty networkto produce a 0 and the fault-free network to produce a 1. Therefore, the fault y/0 isdetected by the pattern w = 1, x = 0, y = 1.
WXY&Z+Figure 3.1. An example logic circuit
PA fault is said to be functionally equivalent to another fault if and only if theoutput function realized by the network with only the first fault present is equal to thefunction realized when only the second fault is present. For example, in the network ofFig. 3.1, in the presence of the fault x/0, the function implemented is y. In the presence ofthe fault z/0, the function implemented by the network is also y. Hence, the faults x/0 andz/0 are functionally equivalent. The set of functionally equivalent faults forms anequivalence class. A fault f1 dominates fault f2 if and only if all input combinations thatdetect f2 also detect f1. In our example, the fault p/0 dominates fault z/0. Techniques forobtaining equivalence and dominance relationships among different fault pairs have beendescribed in [McCluskey 71] and [To 73].
We illustrate the calculation of our design diversity metric with respect to singlestuck-at faults in the circuit of Fig. 3.1. There are 10 single-stuck faults associated withthis network. The faults are: w/0, w/1, x/0, x/1, y/0, y/1, z/0, z/1, p/0 and p/1. Thecorresponding fault equivalence classes are: F1 = {w/0, x/0, z/0}, F2 = {y/1, z/1, p/1}, F3= {w/1}, F4 = {x/1}, F5 = {y/0} and F6 = {p/0}. The set of vectors that detect the faultsin F1 is V1 = {w = 1, x = 1, y = 0}. We write V1 = {110}. Similarly, V2 = {000, 010,100}, V3 = {010}, V4 = {100}, V5 = {001, 011, 101} and V6 = {001, 011, 101, 111, 110}.Here, the number of inputs (n) is 3. Consider the fault pair (f1, f2) = (w/0, p/0). The setof vectors that detect w/0 is V1. The set of vectors that detect p/0 is V6 and V1∩V6 ={110}. Thus, the value of d1,2 is 7/8. In this way all the di,j’s and the D metric can becalculated.
4. A Simulation-Based Approach
As we noted earlier, it is difficult to model the entire complex system13
PRELIMINARY VERSION
mathematically. Even with the stuck-at fault model, it is difficult to derive the exactreliability equation for the following reasons:
9. For a given pair of faults (f1, f2), the calculation of d1,2 is an NP-complete problem[Gary 79]. The problem is related to the NP-complete test generation problem.
10. If multiple stuck-at faults appear in the modules at different cycles, then the reliabilityexpressions will become complicated. In fact, it may not be possible to obtain aclosed form.
Hence, we developed a simulation environment to examine the reliability of aredundant system in the presence of multiple faulty modules.
Table 4.1. Characteristics of simulated designs
Circuit# Inputs# Outputs# SSF (T)# SSF (C)Z5xp1710550610apex491996368578clip956986inc79486506rd8484568398
For generating different designs, we minimized the truth tables corresponding to
some MCNC benchmark circuits (clip, inc, Z5xp1, apex4 and rd84) using espresso.Then, we synthesized logic circuits after applying multi-level optimizations using therugged script available in sis [Sentovich 92]. We subsequently mapped the multi-levellogic circuits to the LSI Logic G-10p technology library [LSI 96]. Next, wecomplemented the outputs in the truth tables of the benchmark circuits to generate newtruth tables. We used the same synthesis procedure for these new truth tables. Finally,we added inverters at the outputs of the new designs obtained. Table 4.1 summarizes thecharacteristics of the different simulated designs.
In the fourth column of Table 4.1, we report the number of candidate single stuckfaults for the implementations of the circuits, obtained by synthesizing the givenspecification. The fifth column shows the number of candidate single stuck faults for theimplementations of the circuits, obtained by synthesizing the given specifications withcomplemented outputs.
Simulation 1
For Simulation 1, for each benchmark circuit, we built duplex systems withidentical and different implementations. For each of these systems, we performed100,000 experiments. In each experiment, we randomly picked up a single stuck-at faultpair (f1, f2) such that the fault f1 affects Module 1 and f2 affects Module 2. We injectedthese faults into the modules, applied input patterns from a counter (with random seed)
14
PRELIMINARY VERSION
and calculated the error latency (the number of cycles after which the system ceases to befault-secure). The expected error latency for the injected fault pairs is shown in Table4.2. We also calculated the percentage of fault pairs for which none of the two modulesproduced the same erroneous outputs at the same time (compensating fault pairs). Theseare the fault pairs (f1, f2) that have d1,2 equal to 1.
CircuitNameZ5xp1apex4clipincrd84
Table 4.2. Simulation 1 resultsCopiesError% compensating
Latencyfault pairs (di,j = 1)(cycles)
T, T673366.96T, C686968.76T, T859485.71T, C809480.51T, T795179.24T, C786978.44T, T766676.54T, C751675.08C, C751274.90T, T763876.23T, C679767.73C, C705170.40
As shown in Table 4.2, a duplex system consisting of different implementations
of the Z5xp1 circuit has a higher percentage of compensating fault pairs, compared to thenon-diverse version — however, that is not generally true. For example, for the clipbenchmark, the non-diverse duplex system has a higher percentage of compensating faultpairs. For compensating fault pairs, the error latency is strictly infinity — we assumedthe value to be 10,000 cycles for our experiments. This is because, the number of inputsof the benchmark circuits under consideration lie between 7 and 9. Thus, the totalnumber of input patterns is between 128 and 512. Note that, the expected error latency isdependent on the number of compensating fault pairs. This dependence of error latencyon the number of compensating fault pairs has been explained earlier in Sec. 2.1.
In [Sakov 87], for a given combinational logic function, the fault detectability profilesfor different implementations have been reported. Further studies are needed tosynthesize circuit structures with high values of di,j for different fault pairs. It has beenproved in [To 73] that, for fanout-free combinational logic networks, all internal singlestuck-at faults are either equivalent to or dominate single stuck-at faults on the primaryinputs of the network. Thus, if we want to implement two diverse fanout-free networksimplementing the same function, the di,j values of the different fault pairs will be stronglydependent on the input combinations detecting the single stuck-at faults on networkinputs and outputs. For both the networks, the set of patterns that detect the input oroutput stuck-at faults is independent of the network structure and is directly determined
15
PRELIMINARY VERSION
by the function the networks are implementing. Thus, chances are low that for fanout-free networks and stuck-at faults, the diversity metric is going to achieve appreciable highvalues for networks synthesized in different ways, compared to simple replication. Thus,it appears to be important to focus on achieving diverse fanout structures of differentnetworks to obtain high values of the diversity metric for fault pairs.
Simulation 2
Our previous simulation results mainly focused on independent faults in multiplemodules of a duplex system. However, it has been observed in the literature [Avizienis84][Lala 94], that design diversity is useful for handling correlated failures and common-mode failures. Since we did not find any data on common-mode failure mechanisms, weperformed the following sets of experiments to estimate the effect of diversity in thepresence of common-mode failures.
CircuitNameZ5xp1clip
Table 4.3. Simulation 2 resultsDuplexCopiesWorst Error Latency#(cycles)1T, T102T, C17113C, C145T, T356T, C3727C, C488T, T169T, C1510C, C1711T, T3512T, C30113C, C21
incrd84
In a duplicated system with identical implementations, we can find a one-to-one
correspondence between the leads of the two copies. Hence, for these duplicatedsystems, we injected fault pairs (f1, f2) such that f1 and f2 affect lead i of Module 1 andModule 2, respectively. Note that, in the presence of f1 and f2, the two modules behaveexactly in the same way. Hence, they can be called common-mode faults. With thissetup, we found the error latency for these common-mode faults. For duplex systemswith different implementations, we cannot establish such a one-to-one correspondencebetween the leads of the two copies. Hence, for each fault f1 in Module 1, we found thefault f2 in Module 2 with the minimum value of d1,2 using exhaustive simulation. Thus,for f1 affecting Module 1, we have the least error latency when f2 affects Module 2.Hence, the fault pair (f1, f2) is called the worst-case fault pair with the worst case latency.Then we averaged the worst-case latencies over all the worst-case fault pairs — thisnumber is reported in the fourth column of Table 4.3.
The results in Table 4.3 show a distinct advantage of using different
16
PRELIMINARY VERSION
implementations over non-diverse designs for common-mode faults. This is because, theworst case error latency of a common-mode fault in a duplex system with differentimplementations is at least an order of magnitude larger than the error latency of acommon-mode fault in a duplex system with identical implementations.
In order to bring into perspective the significance of this increased error latency,we consider the execution of an application that uses the Z5xp1 circuit of Table 4.3. Ifthe mission time of the application is of the order of hundreds of cycles, then the systemwith two identical implementations will fail in the presence of CMFs. However, a systemwith two different implementations of Z5xp1 will be able to finish the task, on anaverage, in the presence of CMFs. Finally, if the mission time is of the order ofthousands of cycles, then in the presence of CMFs, none of these systems will be able tofinish the task successfully. This result can also be explained from the properties of thediversity metric discussed in Sec. 2.1. The relationship of this result with the CMF rate isexplained in Sec. 2.2.
Suppose that we have a system for which the common-mode failures affect onlythe inputs. In such a scenario, the systems with different implementations that weconsidered are not diverse so far as the inputs are concerned. Thus, such systems do notprovide no extra protection against the common-mode failures of interest (affecting onlythe inputs) compared to systems with identical implementations. This argumentmotivates research in developing common-mode fault models and designing redundantsystems with sufficient diversity against the modeled common-mode faults.
5. Self-testing Property
In this section, we discuss the possible effects of having design diversity on theself-testing property of a duplicated system. A duplicated system is called self-testingwith respect to a fault pair (f1, f2) (f1 affecting Module 1 and f2 affecting Module 2) if andonly if, there exists an input combination for which the two modules produce differentoutputs in the presence of the faults.
For the purpose of the experiment, we assume that the failures show up as single-stuck faults in each of the two modules under consideration. The self-testing propertyensures that, in the presence of failures that affect the two modules under consideration,we can detect the presence of the failures. This detection is important for the system totake corrective action and directly affects the system availability as shown in Sec. 2.3.The fourth column of Table 5.1 shows the number of non-self-testable fault pairs induplex systems with identical and different implementations. It is clear from Table 5.1that with different implementations it is possible to achieve high self-testing properties of
17
PRELIMINARY VERSION
the designs under consideration. In fact, an interesting synthesis problem is to synthesizetwo implementations of a given logic function such that the number of self-testable faultpairs is maximum.
Table 5.1. Self-testing properties of diverse and non-diverse duplex systems
CircuitCopies# SSF pairs% escapesName
T, T302,5000.73
Z5xp1C, C372,1000.65
T, C335,5000.02
clipT, T487,2040.58
T, C463,4720.02T, T236,1960.73
incC, C256,0360.84
T, C245,9160.03T, T322,6240.74
rd84C, C158,4041.1
T, C226,00.04
6. Diversity Advantages In Configurable Systems
For evaluating the advantages of diversity in configurable systems, weimplemented duplex systems on a configurable computing test-bed [Wildforce 99]. Thetest-bed contains FPGAs that can be used for mapping designs into the test-bed. Figure6.1 shows the test-bed.
Figure 6.1. Reconfigurable computing test-bed
Duplex systems containing identical and different implementations of the samelogic function were designed. The synthesis tool from Synplicity was used forsynthesizing the different implementations. Xilinx placement and routing tools wereused to map the designs on the test-bed. For each duplex system, we injected stuck-at
18
PRELIMINARY VERSION
faults in the lookup tables of the implementations and for each fault pair, we calculatedthe error latency of that fault pair. We picked the worst-case fault pairs (just likeSimulation 2) and plotted the cumulative distribution showing the percentage of worst-case fault pairs having the error latency less than or equal to a particular value.
In Fig. 6.2(a) we show the cumulative distribution of the worst-case errorlatencies for a duplex system with two identical implementations of the MCNCbenchmark circuit cps.pla with 23 inputs. Note that, the X-axis is in the logarithmicscale. Figure 6.2(b) shows a similar cumulative distribution for a duplex system withdifferent implementations of the same logic function (cps.pla). The faults were injectedby modifying the contents in the FPGA lookup tables. We also calculated the mean errorlatency and it can be seen that the mean error latency is at least an order of magnitudegreater for diverse duplex systems.
The significance of the curves in Fig. 6.2 can be explained with the help of thefollowing example. Consider an application with a mission time of 106 cycles. For asystem with identical implementations (Fig. 6.2(a)), the data-integrity of the system willbe compromised before the mission-time is reached for around 85% of the cases in thepresence of CMFs. In other words, for only around 15% of the CMFs, the system isexpected to successfully complete the task before data corruption occurs. In contrast, ifwe use a duplex system with different implementations, then for around 65% of the cases,the system is expected to successfully finish the task before impacting data-integrity (Fig.6.2(b)). This clearly demonstrates the advantage of diversity against CMFs and puts intoperspective its relationship with the mission-time of the application, as explained in Sec.2.2.
10.80.6Proportion of fault pairs(Cumulative)0.60.4Proportion of fault pairs(Cumulative)0.50.40.30.20.200123456701234567Error Latency (Log base-10)Mean Error Latency = 319,904 cycles(a)Error Latency (Log base-10)(b)Mean Error Latency = 5,255,211 cyclesFigure 6.2. Results from experiments on a configurable computing test-bed
Some of the faults injected in the previous experiment are redundant faults.
19
PRELIMINARY VERSION
Hence, the output data will never become corrupt in the presence of these faults. Theplots in Fig. 6.3(a) and 6.3(b) show the statistics for the same experiments as above afterexcluding the redundant faults. It can be seen that the mean error latency of the diverseduplex system is around two orders of magnitude better than that of the duplex systemwith identical implementations. We have performed many such experiments andobtained similar data that demonstrate the advantages of using diverse duplex systems inconfigurable systems. While it is true that the actual nature of these curves depend on thesequence of input combinations applied, we performed multiple experiments withdifferent input sequences and the curves show the same trend as those shown in Fig. 6.2and Fig. 6.3.
Proportion of fault pairs(Cumulative)10.80.60.40.20.200123456012345670.6Proportion of fault pairs(Cumulative)0.50.40.3Error Latency (Log base-10)Mean Error Latency = ,565 cycles(a)Error Latency (Log base-10)Mean Error Latency = 5,156,050 cycles(b)Figure 6.3. Results from experiments on a configurable computing test-bed (Excluding redundant
faults)
7. Conclusions
In this paper, we addressed the problem of design diversity in redundant (softwareor hardware) systems in order to handle common-mode failures and failures in multiplemodules. In order to protect fault-tolerant systems against common-mode failures,design diversity has been used commercially. In the past, design diversity was defined tobe “independent” generation of “different” designs. This notion of diversity is qualitativeand has limitations because it does not provide any quantitative basis to comparereliabilities of different diverse systems. Hence, the need for a metric to quantifydiversity between different systems has been expressed in the past.
In this paper, for the first time, we have introduced a metric to quantify diversityamong different designs under a particular fault-model, and explained how to calculatethe overall system reliability in terms of this metric. In our example of the calculation of20
PRELIMINARY VERSION
diversity for combinational logic circuits (Sec. 2.1), we have assumed that all the inputcombinations are equally likely. In the absence of any information about the relativefrequency of the different input combinations, this is a reasonable assumption. However,for a particular application, if we have information about the relative frequencies (in theform of input traces, for example), then we can appropriately modify the aboveexpression to incorporate this extra information (by changing the weights associated withdifferent input combinations).
We have also produced simulation for duplex system. Our theoretical andsimulation results indicate that, in the presence of independent multiple module failuresin redundant systems, mere use of different implementations does not guarantee higherreliability compared to redundant systems with identical implementations. It is moreimportant to evaluate the reliability of the systems using our metric. On the other hand,for common-mode failures and design faults, there is a significant gain with differentimplementations. However, the gain decreases with increasing mission time. Ouranalysis technique can be used to derive relationships between system reliability,diversity, mission time and system failure rate and compare reliabilities of multiplediverse systems. These relationships can help understand the cost and reliabilitytradeoffs while designing redundant systems with diversity.
For common-mode failures, diverse systems have no worse reliability comparedto replicated systems. However, there is a further need to characterize common-modefailure mechanisms in the circuit level. With a good CMF fault model, (logical or layout-level) synthesis techniques can be used to incorporate sufficient diversity to protectsystems against the modeled faults.
Our simulation results demonstrate that diversity plays an important role inenhancing the self-testing property of duplex systems. This can prove to be useful if wecan apply specific patterns to the system during idle cycles.
8. Acknowledgments
This work was supported by Defense Advanced Research Project Agency(DARPA) under Contract No. DABT63-97-C-0024 (ROAR project). The authors wouldlike to thank Dr. Santiago Fernandez Gomez of Stanford Center for Reliable Computingfor his help with the experiments on the configurable computing test-bed. Thanks aredue to Mr. Robert Wei-Je Huang and Mr. Philip Shirvani of Stanford Center For ReliableComputing.
21
PRELIMINARY VERSION
9. References
[Avizienis 77] Avizienis, A. and L. Chen, \"On the implementation of N-versionprogramming for software fault-tolerance during program execution,\" Proc. Intl.Computer Software and Appl. Conf., pp. 149-155, 1977.
[Avizienis 84] Avizienis, A. and J. P. J. Kelly, “Fault Tolerance by Design Diversity:Concepts and Experiments,” IEEE Computer, pp. 67-80, August, 1984.
[Briere 93] Briere, D. and P. Traverse, “Airbus A320/A330/A340 Electrical FlightControls: A family of fault-tolerant systems,” Proc. FTCS, pp. 616-623, 1993.
[Eckhardt 85] Eckhardt, D. E. and L. D. Lee, “A theoretical basis for the analysis ofmulti-version software subject to coincident failures,” IEEE Trans. Software Engg., Vol.SE-11, pp. 1511-1517, Dec. 1985.
[Gary 79] Gary, M. and D. Johnson, Computers and Intractability: A Guide to theTheory of NP-Completeness, W. H. Freeman and Company, 1979.
[Lala 94] Lala, J. H. and R. E. Harper, “Architectural principles for safety-critical real-time applications,” Proc. of the IEEE, vol. 82, no. 1, pp. 25-40, January, 1994.
[Liu 97] Liu, J., et. al., \"Heavy ion induced single event effects in semiconductordevice,\" Proc. Intl. Conference on Atomic Collisions in Solids, 1997.[LSI 96] G10-p Cell-Based ASIC Products Databook, LSI Logic, May 1996.
[Lyu 91] Lyu, M. R. and A. Avizienis, \" Assuring design diversity in N-versionsoftware: a design paradigm for N-version programming,\" Proc. DCCA, pp. 197-218,1991.
[Lyu 95] Lyu, M., Handbook of Software Reliability Engineering, Computer SocietyPress, 1995.
[McCluskey 71] McCluskey, E. J. and F. W. Clegg, \"Fault Equivalence in combinationallogic networks,\" IEEE Trans. On Computers, Vol. C-20, No. 11, pp. 1286-1293, Nov.1971.22
PRELIMINARY VERSION
[McCluskey 88] McCluskey, E. J., S. Makar, S. Mourad and K. D. Wagner, “ProbabilityModels for Pseudo-random Test Sequences,” IEEE Trans. Computers, Vol. 37, No. 2, pp.160-174, Feb. 1988.
[Mitra 99] Mitra, S., N. R. Saxena and E. J. McCluskey, “A Design Diversity Metric andReliability Analysis for Redundant Systems,” Proc. Intl. Test Conf., pp. 662-671, 1999.[Pradhan 96] Pradhan, D. K., Fault-Tolerant Computer System Design, Prentice Hall,1996.
[Reed 97] Reed, R., et. al., \"Heavy ion and proton-induced single event multiple upset,\"IEEE Trans. on Nuclear Science, Vol. 44, No. 6, pp. 2224-2229, July 1997.
[Riter 95] Riter, R., \"Modeling and Testing a Critical Fault-Tolerant Multi-ProcessSystem,\" Proc. FTCS, pp. 516-521, 1995.
[Sakov 87] Sakov, J. and E. J. McCluskey, “Functional Test Pattern Generation forRandom Logic,” CRC TR 87-1, Center For Reliable Computing, Stanford Univ., 1987.[Saxena 98] Saxena, N.R., and E.J. McCluskey, \"Dependable Adaptive ComputingSystems,\" Proc. IEEE Systems, Man and Cybernatics Conf., San Diego, pp. 2172-2177,1998.
[Shedletsky 76] Shedletsky, J.J., and E.J. McCluskey, \"The Error Latency of a Fault in aSequential Digital Circuit,\" IEEE Trans. Computers, C-25, No. 6, pp. 655-659, June1976.
[Sentovich 92] Sentovich, E. M., et. al., “SIS: A System for Sequential CircuitSynthesis,” ERL Memo. No. UCB/ERL M92/41, EECS, UC Berkeley, CA 94720.[Siewiorek 75] Siewiorek, D. P., “Reliability modeling of compensating module failuresin majority voted redundancy,” IEEE Trans. Comp., vol. 24., no. 5, pp. 525-533, 1975.[Siewiorek 92] Siewiorek, D. P. and R. S. Swarz, Reliable Computer Systems: Designand Evaluation, Digital Press, 1992.23
PRELIMINARY VERSION
[Stroud 94] Stroud, C. E., “Reliability of Majority Voting Based VLSI Fault-TolerantCircuits,” IEEE Trans. on VLSI, vol. 2, no. 4, pp. 516-521, December, 1984.
[Tamir 84] Tamir, Y. and C. H. Sequin, \"Reducing common mode failures in duplicatemodules,\" Proc. ICCD, pp. 302-307, 1984.
[To 73] To, K., \"Fault Folding for Irredundant and Redundant Combinational Circuits,\"IEEE Trans. Comp., Vol. C-22, No. 11, pp. 1008-1015, Nov. 1973.
[Tohma 71] Tohma, Y. and S. Aoyagi, \"Failure-tolerant sequential machines with pastinformation,\" IEEE Trans. Computers, Vol. C-20, No. 4, pp. 392-396, April 1971.[Tomek 93] Tomek, L. A., J. K. Muppala and K. S. Trivedi, “Modeling Correlation inSoftware Recovery Blocks,” IEEE Trans. Software Engg., Vol. 19, No. 11, pp. 1071-1086, Nov. 1993.
[Von Neumann 56] Von Neumann, J., “Probabilistic Logics and the synthesis of reliableorganisms from unreliable components,” Automata Studies, Ann. of Math. Studies, no. 34,pp. 43-98, 1956.
[Wildforce 99] The Wildforce Board Manual, Annapolis Microsystems, Inc., 1999.APPENDIX
1. Reliability Calculation
In this section, we derive an expression for the probability that a duplex systemproduces correct outputs up to time t when two modules are affected by faults f1 and f2,respectively. As explained in Sec. 3, there are two cases that must be considered. In thefirst case, the faults f1 and f2 appear simultaneously. Let p be the probability that amodule is affected by a fault at any time instant. Also, we assume that only a single stuckfault can affect a particular module. With these assumptions, the probability that the faultpair appears at time instant i is given by the following expression:
(1−p)2(i−1)p2The above expression follows from the fact that the two modules are fault-free upto time (i-1) and become faults at time instant i. Once the fault pair (f1, f2) arrives at timeinstant i, the probability that the system will produce correct outputs from time instant up24
PRELIMINARY VERSION
to time t is obtained by multiplying d1,2 (t - i + 1) times to obtain the followingexpression:
t−i+1
(1−p)2(i−1)p2d1,2
In the above expression, i can vary from 1 to t. Thus, we have the followingsummation:
i=12(i−1)2t−i+1(1−p)pd1,2∑tThe above summation evaluates to the following expression (by summation of
Geometric Progression series):
2ps1(f1, f2, t) = d1,2t[d1,2−(1−p)2t][d1,2−(1−pA)2]Next, we consider the case where the faults f1 and f2 do not appear
simultaneously. First, we consider the case where fault f1 appears earlier. Theprobability that faults f1 and f2 appear at times i and j, respectively (j > i) is given by:
(1−p)i−1(1−p)j−1p2
In a duplex system, as long as two modules are working correctly, correct outputsare produced by the system. However, once fault f2 affects Module 2, the probability thatthe system produces correct outputs starting from time j to time t can be obtained bymultiplying the above expression by d1,2 (t - j + 1) times as shown below:
t−j+1
(1−p)i+j−2p2d1,2
Here, i can vary from 1 to t-1 and j can vary from i+1 to t. Thus, we have thefollowing expression:
∑∑(1−p)i=1j=i+1t−1t(i+j−2)pd1,22t−j+1In the above discussion, we assumed that f1 appears before f2. In order toconsider the other case, we can multiply the above expression by 2 to obtain thefollowing expression:s2(f1,f2, t) =
t−12t−2
]2222[d1,2−(1−p)
(1−p)pd1,2−(1−p)tpd1,2[1−(1−p)t−1]2(d1,2−1+p)[d1,2−(1−p)](d1,2−1+p)s(f1, f2, t) = s1(f1, f2, t) + s2(f1, f2, t) is the probability that the system producescorrect outputs when Module 1 is affected by fault f1 and Module 2 is affected by faultf2.
25
PRELIMINARY VERSION
We can extend the above derivations for common-mode failures. For thatpurpose, we assume that p is the probability that a given pair of modules get affected.Let f be a common-mode failure that affects both the modules such that faults f1 and f2appear in Module 1 and Module 2, respectively. The probability that the failure shows upat time instant i is given by the following expression:
(1−p)i−1pOnce f arrives at time instant i, the probability that the system will produce correctoutputs from time instant up to time t is obtained by multiplying d1,2 (t - i + 1) times toobtain the following expression:
t−i+1
(1−p)i−1pd1,2
Here, i can vary from 1 to t. Thus, we obtain the following expression:
z(f1, f2, t) = pd1,2t−(1−p)t][d1,2[d1,2−(1−p)]As mentioned in Sec. 2.2, since we are considering CMFs, we do not consider the
case where f1 and f2 arrive at different times.
26
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- aiwanbo.com 版权所有 赣ICP备2024042808号-3
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务