Reliability and Availability BasicsRealtime and embedded systems are now a central part of our lives. Reliable functioning of these systems is of paramount concern to the millions of users that depend on these systems everyday. Unfortunately most embedded systems still fall short of users expectation of reliability. In this article we will discuss basic techniques for measuring and improving reliability of computer systems. The following topics are discussed: Failure CharacteristicsHardware FailuresHardware failures are typically characterized by a bath tub curve. An example curve is shown below. The chance of a hardware failure is high during the initial life of the module. The failure rate during the rated useful life of the product is fairly low. Once the end of the life is reached, failure rate of modules increases again. Hardware failures during a products life can be attributed to the following causes:
Software FailuresSoftware failures can be characterized by keeping track of software defect density in the system. This number can be obtained by keeping track of historical software defect history. Defect density will depend on the following factors:
Defect density is typically measured in number of defects per thousand lines of code (defects/KLOC). Reliability ParametersMTBFMean Time Between Failures (MTBF), as the name suggests, is the average time between failure of hardware modules. It is the average time a manufacturer estimates before a failure occurs in a hardware module. MTBF for hardware modules can be obtained from the vendor for off-the-shelf hardware modules. MTBF for inhouse developed hardware modules is calculated by the hardware team developing the board. MTBF for software can be determined by simply multiplying the defect rate with KLOCs executed per second. FITSFITS is a more intuitive way of representing MTBF. FITS is nothing but the total number of failures of the module in a billion hours (i.e. 1000,000,000 hours). MTTRMean Time To Repair (MTTR), is the time taken to repair a failed hardware module. In an operational system, repair generally means replacing the hardware module. Thus hardware MTTR could be viewed as mean time to replace a failed hardware module. It should be a goal of system designers to allow for a high MTTR value and still achieve the system reliability goals. You can see from the table below that a low MTTR requirement means high operational cost for the system.
MTTR for a software module can be computed as the time taken to reboot after a software fault is detected. Thus software MTTR could be viewed as the mean time to reboot after a software fault has been detected. The goal of system designers should be to keep the software MTTR as low as possible. MTTR for software depends on several factors:
AvailabilityAvailability of the module is the percentage of time when system is operational. Availability of a hardware/software module can be obtained by the formula given below. Availability is typically specified in nines notation. For example 3-nines availability corresponds to 99.9% availability. A 5-nines availability corresponds to 99.999% availability. DowntimeDowntime per year is a more intuitive way of understanding the availability. The table below compares the availability and the corresponding downtime.
|
|