System Reliability and AvailabilityWe have already discussed reliability and availability basics in a previous article. This article will focus on techniques for calculating system availability from the availability information for its components. The following topics are discussed in detail: System AvailabilitySystem Availability is calculated by modeling the system as an interconnection of parts in series and parallel. The following rules are used to decide if components should be placed in series or parallel:
Availability in SeriesAs stated above, two parts X and Y are considered to be operating in series if failure of either of the parts results in failure of the combination. The combined system is operational only if both Part X and Part Y are available. From this it follows that the combined availability is a product of the availability of the two parts. The combined availability is shown by the equation below: The implications of the above equation are that the combined availability of two components in series is always lower than the availability of its individual components. Consider the system in the figure above. Part X and Y are connected in series. The table below shows the availability and downtime for individual components and the series combination.
From the above table it is clear that even though a very high availability Part Y was used, the overall availability of the system was pulled down by the low availability of Part X. This just proves the saying that a chain is as strong as the weakest link. More specifically, a chain is weaker than the weakest link. Availability in ParallelAs stated above, two parts are considered to be operating in parallel if the combination is considered failed when both parts fail. The combined system is operational if either is available. From this it follows that the combined availability is 1 - (both parts are unavailable). The combined availability is shown by the equation below: The implications of the above equation are that the combined availability of two components in parallel is always much higher than the availability of its individual components. Consider the system in the figure above. Two instances of Part X are connected in parallel. The table below shows the availability and downtime for individual components and the parallel combination.
From the above table it is clear that even though a very low availability Part X was used, the overall availability of the system is much higher. Thus parallel operation provides a very powerful mechanism for making a highly reliable system from low reliability. For this reason, all mission critical systems are designed with redundant components. (Different redundancy techniques are discussed in the Hardware Fault Tolerance article) Partial Operation AvailabilityConsider a system like the Xenon switching system. In Xenon, XEN cards handle the call processing for digital trunks connected to the XEN cards. The system has been designed to incrementally add XEN cards to handle subscriber load. Now consider the case of a Xenon switch configured with 10 XEN cards. Should we consider the system to be unavailable when one XEN card fails? This doesn't seem right, as 90% of subscribers are still being served. In such systems where failure of a component leads to some users loosing service, system availability has to be defined by considering the percentage of users affected by the failure. For example, in Xenon the system might be considered unavailable if 30% of the subscribers are affected. This translates to 3 XEN cards out of 10 failing. The availability for this system can be computed by calculating A(p,q) as specified below: A(p,q) = C(q,p) * A^(q-p) * (1-A)^p Here p is the number of failed units and q is the total number of units. Availability Computation ExampleIn this section we will compute the availability of a simple signal processing system. Understanding the SystemAs a first step, we prepare a detailed block diagram of the system. This system consists of an input transducer which receives the signal and converts it to a data stream suitable for the signal processor. This output is fed to a redundant pair of signal processors. The active signal processor acts on the input, while the standby signal processor ignores the data from the input transducer. Standby just monitors the sanity of the active signal processor. The output from the two signal processor boards is combined and fed into the output transducer. Again, the active signal processor drives the data lines. The standby keeps the data lines tristated. The output transducer outputs the signal to the external world. Input and output transducer are passive devices with no microprocessor control. The Signal processor cards run a real-time operating system and signal processing applications. Also note that the system stays completely operational as long as at least one signal processor is in operation. Failure of an input or output transducer leads to complete system failure. Reliability Modeling of the SystemThe second step is to prepare a reliability model of the system. At this stage we decide the parallel and serial connectivity of the system. The complete reliability model of our example system is shown below: A few important points to note here are:
Calculating Availability of Individual ComponentsThird step involves computing the availability of individual components. MTBF (Mean time between failure) and MTTR (Mean time to repair) values are estimated for each component (See Reliability and Availability basics article for details). For hardware components, MTBF information can be obtained from hardware manufactures data sheets. If the hardware has been developed in house, the hardware group would provide MTBF information for the board. MTTR estimates for hardware are based on the degree to which the system will be monitored by operators. Here we estimate the hardware MTTR to be around 2 hours. Once MTBF and MTTR are known, the availability of the component can be calculated using the following formula: Estimating software MTBF is a tricky task. Software MTBF is really the time between subsequent reboots of the software. This interval may be estimated from the defect rate of the system. The estimate can also be based on previous experience with similar systems. Here we estimate the MTBF to be around 4000 hours. The MTTR is the time taken to reboot the failed processor. Our processor supports automatic reboot, so we estimate the software MTTR to be around 5 minute. Note that 5 minutes might seem to be on the higher side. But MTTR should include the following:
Things to note from the above table are:
Calculating System AvailabilityThe last step involves computing the availability of the entire system. These calculations have been based on serial and parallel availability calculation formulas.
|
|