Originally published in Microelectron. Reliab., Vol. 35. Nos 9-10,
pp. 1347-1356, 1995
Copyright Elsevier Science Ltd.
QUANTIFYING UNCERTAINTY IN RELIABILITY AND SAFETY STUDIES
Patrick D.T. O’Connor
Abstract
The paper discusses the extent to which the quantification of reliability and safety of engineering systems can be considered to be credible and useful, and the reasons why the standard methods normally used can be misleading. The problem is discussed in relation to prediction and to measurement, for hardware, software and human aspects of engineering systems. Suggestions are made for improving the credibility and value of reliability and safety predictions and measurements.
Introduction
It is often necessary to try to quantify reliability. For new system designs we need to know whether requirements far reliability and safety will be achieved. Other factors, such as warranty costs and requirements for spares and test equipment depend on reliability. Contracts often stipulate that reliability must be predicted and demonstrated quantitatively, and the methods to be used may be specified.
There is an increasing tendency towards quantification of safety hazards, the objectives being to show that a proposed system meets criteria for safety, or to determine a “safety benefit” expressed as hazard reduction per unit of expenditure.
All of these efforts are of course well-intentioned. They are seen to be justified by Lord Kelvin’s aphorism that, if we cannot express a concept numerically, that our knowledge of it is a “poor kind”. They also reflect the fact that engineers have had scientific educations, and are therefore intuitively unhappy with concepts that are not subject to rules and amenable to quantification.
Failures and Physics
To quantify the reliability or safety of an engineering system, we must obviously predict or count the number of times it is likely to fail or has failed in ways that affect reliability or safety. However, reliability and safety are not concepts like mass or power that can be predicted using known laws, or measured with instruments. When we say that a product weighs X kilograms and delivers Y watts there is no ambiguity or uncertainty, beyond the normally expected and small variations inherent in engineering products and applications. However the mass and power have been predicted or measured, we know that all nominally identical products will have the same properties.
We also understand cause and effect relationships in relation to such physical parameters. We can use this knowledge to change the design in order to reduce weight or to increase power. A fundamental part of this knowledge is the universal agreement on definitions of the units concerned: all kilograms are the same as other kilograms. The effect of one is the same as the effect of any other. Therefore the rules of mathematics can be applied to them, and the results make sense and are believable.
In marked contrast, system failures are extremely diverse in kind and in their effects. It is usually difficult to have agreement on how to classify the kinds of events that relate to reliability and safety. For example, is a leaking hydraulic hose a failure or not? If it is, when did the leak become bad enough to be classed as a failure? We all know when a light bulb fails or a car fails to start. However, if a microprocessor chip behaves erratically at -40° C this would be classified as a failure if it is tested at this temperature, but would not be apparent if the chip were used in a PC in the office. In other words, failures, whether predicted or reported, are to a large extent the results of human perceptions and interpretations.
Causes of Failure
Failures of engineering systems have many causes. At the level of the individual components, such as electronic devices, springs, bearings, etc., failure can be caused by overstress (electrical, mechanical, thermal, etc.), by wearout mechanisms such as material fatigue, wear and corrosion, and by changes in properties due to age or use, such as drift in electronic component parameter values or loss of lubrication between moving parts. Failures can also occur as a result of the inherent variation that exists in parameter values and dimensions due to variation of production processes and tolerances. Variation also exists in the conditions in which systems operate. Parametric and environmental variations can have individual and combined effects that lead to failure.
Failures often result from effects at the system level, not involving actual component failure. Failures due to variation and tolerance effects are often in this category. Other causes are electromagnetic interference and timing problems in digital electronic systems, operator or maintenance errors, and incorrect diagnosis by test software.
A large proportion of the failures of components and systems is caused by shortfalls in the quality of manufacture, assembly and test. Such failures may or may not be related to the design. Again there is a great range of possibilities of such failures in modern systems, for example a defective solder joint that breaks open in service, a defect in a casting leading to early fatigue failure, and an electrical cable damaged during assembly.
Problems of Reliability Prediction
Predicting reliability is clearly a much more uncertain exercise than predicting physical aspects such as mass and power. We must take account of the great range of possible causes of failure, at the levels of parts, subsystems and system. We must take account of human aspects such as interpretation of events, human and process errors in production, and operation and maintenance influences. The rate of occurrence of failures due to such disparate and uncertain causes is inevitably subject to wide uncertainty, as experience shows.
Despite this, standard methods have been developed for reliability prediction. These are all based on variations of the “parts count” approach, which assumes that the reliabilities of all of the components of a system are known, and that the system reliability is the product of the component reliabilities (or the failure rate is the sum of the component failure rates). The most widely applied method is US Military Handbook 217 (Reliability Prediction for Electronic Equipment), though other methods and sources of “failure data” are also available. Methods such as MIL-HDBK-217 are grossly misleading and inadequate for several reasons. They claim to enable us to predict, for example, the failure rate per million hours of a diode while being fired from a cannon, or the failure rate per million hours of a mirror. Reference 1 deals specifically with the shortcomings of MIL-HDBK-217, but the main failings of all such methods are:
- They assume that all system failures are caused by component failures, and that all component failures cause system failures.
- They assume that failure data from the past will represent the future reliability, that is, they ignore the often very significant improvements in component reliability brought about by modern design and processes.
- They assume that all components have failure rates that are constant in time. This is particularly inappropriate for most mechanical components.
- They ignore the effects of quality of design and production, in relation to aspects such as protection, tolerancing, and quality control.
- They assume that failure is an inherent property of every component.
Of course every practical engineer knows that such assumptions are grossly incorrect for most components and systems. It is therefore no surprise that the methods that have been developed, and which are still being perpetrated (most notably the IEC Reliability Committee TC56 is preparing an international standard on electronics reliability prediction, based on these methods), generate “predictions” that turn out to be orders of magnitude in error in either direction, and in relation to the causes of failure experienced.
The methods are often defended on the ground that they “provide a basis for comparison of designs”. This is analogous to using a random collection of rocks as a standard for mass, or sensation as a standard for electric potential.
There are some situations in which the “parts count” or “block diagram” method can be applied with reasonable credibility. The criteria for such applications must be that the failure data on the components is accurate, and that there is good reason to believe that the rate and pattern of failures in future will be the same as in the past, or that the effect on reliability of any changes of design or application can be forecast with confidence. Again, practical engineers know very well how uncertain forecasts can be even under such circumstances. A recent example of such forecasting is the assessment of multiple engine shutdown on new commercial aircraft, in order to justify certification of twin engine aircraft for extended overwater flights. The engines and their application were the same as on earlier three and four engine aircraft, and there was considerable data on past operations.
Software
Software forms part of most modern engineering systems. Therefore predicting and measuring its contribution to reliability and safety has exercised the minds of system and software engineers and reliability academics. Reference 2 is a well known text on the subject.
Software is not prone to the causes of failure that affect hardware. It cannot degrade or break, and there is no variation from copy to copy. System failures occur when the software contains errors, which cause the system to malfunction under particular operating conditions. The rate at which such events occur depends upon the nature of the errors and the ways in which the system is used. Errors are created by the people involved in specifying, designing and coding. If a failure occurs its recurrence can be prevented by changing the software. Therefore reliability predictions or measurements based upon quantifying the numbers of errors or of failures during particular tests or use cannot be credible indicators of future reliability. Despite this, there is considerable literature on the topic, with various “experts” propounding and comparing different “models” for software reliability. The wide disparity of approaches and the fact that the methods proposed have not been taught or applied in mainstream software engineering indicate their poverty and lack of logical basis.
Human Error
Attempts have been made to generate databases of human error probabilities. Tables have been published giving figures such as “probability of misreading a dial” and “probability of misinterpreting an indicator”. Apart from the obvious question of what is the difference, such numbers imply that there is one kind of dial or indicator, one type of person, and one set of operating and use conditions. Such gross oversimplifications betray the lack of logic and scientific basis for such figures, and lead to justified scepticism and mistrust.
Predicting Safety
Systems are designed, built and operated to ensure that attendant hazards are kept as low as possible. Whilst users can tolerate failures every few hours or days for an aircraft or a nuclear power station, society expects that crashes and radioactive discharges will occur at far lower rates, bordering on never. In order to predict the likelihood of such rare events, “quantitative risk assessment” (QRA) methods have been developed. These are based on the techniques used for reliability prediction, with added refinements to take account of multiple failures, failures due to external events, and common cause failures. QRA deals with probabilities of the order of 10EXP(-6) to 10EXP(-10), and even lower, per year or other unit of exposure. Predicting the rate of occurrence of such rare events is obviously a highly uncertain exercise, particularly as they are really supposed never to occur. It is not surprising that when such events do occur, the causes are nearly always due to failures or combinations of failures that were not predicted, thus calling into question the value and credibility of the prediction.
Safety predictions and QRA include methods such as fault tree analysis and cause-consequence analysis to attempt to take account of multiple failures, human error, and external factors. Whilst these methods can be useful in helping to identify possible causes of hazards, it is important that undue attention to the quantitative aspects does not mislead. An interesting recent example of this is the recent certification of twin-engine passenger aircraft, described earlier. In analysing this type of situation it is essential that single engine failures that can cause an accident, such as the Amsterdam crash of the Boeing 747, are taken into account: the failure caused an accident because it led to the loss of two engines, but the same failure on a twin-engine aircraft would probably not have resulted in an accident.
Measuring Reliability and Safety
We can measure historic reliability and safety, and express them quantitatively. Methods have been developed and standardised for demonstrating reliability, the best known being MIL-STD-781. Such methods are subject to much of the same criticism as are predictive techniques based on parts level “data”. Further criticisms are:
- How should failures be counted if corrective action is taken to prevent them or to reduce their occurrence?
- How should we count failures that have not yet occurred, but which are very likely or certain to in future?
Of course any measurement of safety will need to be made over a long period of time, and it is never practicable to consider a quantified safety demonstration as part of a development programme. The best that can be done is to demonstrate that individual safety-critical failures cannot occur during the operating life of the system. This is the normal procedure, for example, in design and test of structural components subject to fatigue, such as pressure vessels, aircraft structures, and aircraft engine components. If safety-critical failures occur in service, corrective action is always taken, making any “measurement” based on data up to that time meaningless for forecasts of future behaviour.
Experts
One result of the emphasis on quantification of reliability and safety has been the growth in the number of “experts”, in industry, consulting and academia. These people inevitably find it difficult to let go of the geese that lay their golden eggs, and therefore defend the development and use of quantitative techniques and “models”. There has been for several years an angry controversy over MIL-HDBK-217, with some people and organisations condemning its use, while its sponsors and others denounce the critics and their arguments. Such controversies do not surround other aspects of engineering, where science and logic provide the criteria of acceptability of predictions and measurements.
The Right Way
Since it is necessary to attempt to predict the reliability and safety of new engineering systems, we must use methods that take account of the practical engineering and human realities involved. Foremost amongst these is to recognise and indicate the very wide uncertainty in such predictions. Any prediction that presents results to several significant figures is clearly unrealistic in its claims to accuracy and precision, and betrays lack of understanding of the wide uncertainties involved.
Predictions must be based upon intent. Since failures are ultimately caused by people, the ways in which they are managed, as designers, test engineers, producers, users and maintainers, will be the major determinant of reliability and safety. Coupled to this are other management aspects such as the contractual or other motivations involved and the investment in reliability and safety analysis and test.
Safety and reliability predictions must take account of technical risks, such as the use of new components or materials and operation at higher stresses or in different environments. Technical risks always widen the extent of uncertainty of the predictions. Reliability and safety predictions should be updated as experience is developed and as improvements are made. Reference 3, the recently updated UK Ministry of Defence guidelines on reliability prediction, provides a rational and practical framework for reliability prediction, and its principles can be extended to safety aspects.
A reliability demonstration test is never as effective as a test planned to accelerate the onset of as many failures as possible, in generating reliability improvement. Measuring reliability in use can be a more useful exercise than doing so during development, since the true operating conditions are applied, and the design will have been completed. The measured reliability can be used to identify need for improvement and to refine logistics. However, it is still important that attention is focused on causes of failure, rather than on measurement, since the measurement provides no information on which to base improvements.
References
- P.D.T. O’Connor, Reliability Prediction: Help or Hoax? Solid State Technology, August 1990.
- Musa, A. lannino, and K. Okumoto, Software Reliability Prediction and Measurement. McGraw-Hill, 1987.
- UK Defence Standard 00-41 (Issue 3 1993).
