Patrick D.T. O'Connor

Engineering Management, Reliability, Test, Safety

Welcome to my Homepage.


Please note that I am no longer providing consulting or training. I am very sorry that my homepage died some years ago. However, I am now recovered from the problems that I faced back then and hope that this reduced version will enable us to be in contact again.

The main drivers of my teaching on engineering management, including quality, reliability and safety, are:

– Dilbert is far wiser than most management “gurus”. He is also experienced.
– Quality, reliability and safety are driven by good management and good engineering. Maths and stats play a very minor part.
– Standards like ISO9000 and others for reliability and safety are misleading and dangerous.

I hope that you will find the information and inspiration you seek. Please contact me if you think that I can help in any way.

  e-mail: pat@pdtoconnor.uk

See my CV (will appear at end of page)


LATEST

A Grain of Sand. An introduction to the wonders of science. Published on Amazon 2024. See below.
The God Perception. What do people believe? Why do people believe? Amazon 2025. See below.

Old reliability engineer goes crazy! Sent away for incredible treatment! Read about it!
Walter Minion’s Therapy. See below.

The 6th. edition of Practical Reliability Engineering was published in 2025.

Test Engineering was published in 2001: the only book that covers the whole engineering and management spectrum of testing.

The New Management of Engineering was published in 2004, and is now re-published by Amazon/Kindle. It is the only book that maps the teaching of Peter Drucker on to the practice of engineering management.

  
 BOOKS, JOURNALS AND PAPERS

This section describes the books I have written , the journal Quality and Reliability Engineering International, and other articles.

A Grain of Sand. Amazon /Kindle 2024
An introduction to the wonders of science, from the Big Bang to human life, with emphasis on the connectness of nature and the mysteries of creation.

The God Perception. Amazon/Kindle 2025
What do humans believe about gods? How does science affect belief? Why do people believe? The book provides answers.

The New Management of Engineering.  Amazon/Kindle 2022
Managing engineering is more difficult, more demanding and more important than any other activity in modern society. The book explains how the principles of Peter Drucker’s “New Management” should be applied to the human and technological aspects of engineering. It provides fresh insights into the management of design, test, manufacture and use. It explains the poverty of some of the ideas that dominate much of modern management.
 It is the only book on the subject that truly reflects the realities of engineering and explains how world-class engineering companies operate.

Practical Reliability Engineering. Patrick D.T. O’Connor and Andre Kleyner (John Wiley, sixth edition 2025). 

The only book that treats reliability as essentially an engineering and management subject. Probably the world’s most popular book on the subject, and now further updated and expanded. Andre Kleyner provided descriptions of reliability analysis software and other updates. The book covers all of the requirements of the ASQ examination for Certified Reliability Engineer. An answers manual for the student questions is available from the publisher.

Test Engineering. John Wiley 2001 

Testing is an essential, expensive and difficult part of engineering design, development, manufacture and support. Yet it is rarely taught as part of engineering training, it is ignored in books on engineering management, and until now there have been no books that cover testing philosophy, methods, technology aspects, economics, and management. This new book is the first to do so. It emphasises an integrated, multidisciplinary approach, and the use of highly accelerated stress testing.

In My Humble Opinion

A collection of my writings, including editorials, book reviews, papers and other pearls of wisdom.

Walter Minion’s Therapy

A combination of black comedy, allegory, high adventure, and exploration of human relationships in a crazy world.

Walter Minion is an archetypical retired engineer. His life is gentle and domesticated. But gradually little stresses mount, culminating in his bizarre attempt at suicide. He is incarcerated in a mental institution where he is treated in inexplicable ways by strange characters. His enforced therapy is to undertake a voyage around the world, alone, on a small yacht. In mid-Atlantic a wondrous girl swims alongside and climbs aboard. Spray is feminine perfection, but Walter discovers that she is also weirdly unnatural.
The two are caught up in a series of wild adventures as they cross oceans, make exotic landfalls and survive terrifying dangers. They encounter more people: good, evil and mad. Throughout their odyssey Spray holds out tantalising prospects of submission and erotic rapture. Walter wrestles with the emotional conflicts that ensue, and with humanity’s stupidity and hatred, beauty and love. Walter’s therapy sails to a surprising finale, a counterpoint of tragedy and joyful triumph.
But what is reality, and what is delusion?

I have also written a prequel, Walter Minion’s Secret Life, which is his story leading up to his going mad. He works on nuclear weapons, is involved in international skulduggery, and other life-changing episodes.

ALL MY BOOKS ARE AVAILABLE ON AMAZON.COM. Click on the titles.

 
Quality and Reliability Engineering International. Editors Aarnout C. Brombacher, Douglas Montgomery and Loon Ching Tang. John Wiley and Sons Ltd.
The bimonthly journal that links quality and reliability engineering, with the emphasis on practical application and modern technology. Includes reviews, special issues, events calendar, news digest, and more.
I was UK Editor until September 1999. Past issues contain many of my editorials and reviews.

Encyclopaedia Chapter:

Quality and Reliability Engineering. Encylopaedia of Physical Science and Technology (Academic Press).

Papers: (will appear at end of page when clicked)

IEC/ISO61508: Letter on the new standard on electronics/software safety, published in IEEE Spectrum Aug 2000. Read it? IEC61508

Reliability Past, Present and Future. Paper published in IEEE Trans Reliability, 2001. Read it? Reliability 2000

Is scientific management dead? An article on the adverse effects of reliance on “scientific” methods, typified by ideas such as business process re-engineering (BPR), ISO9000, MBAs, etc. It has not been published, because journals like Harvard Business Review and Management Today would not accept it. (Academic journals are reluctant to publish opinions that clash with the accepted wisdom). Read it? smdead

Standards in reliability and safety engineeringReliability Engineering and System Safety (1998). Read it? standards

ISO9000: help or hoax? Quality World (1991)
 Read it? iso9000

Quantifying uncertainty in reliability and safety studies. Society of Reliability Engineers Symposium, Arnhem, 1993 (keynote paper). Read it? quantifyinguncertainty

Achieving World Class Quality and Reliability: Science or Art? Quality World (1993)
 Q&R Science or Art?

Quality and reliability: illusions and realities. Quality and Reliability Engineering International, vol. 9 163-168 (1993).

Statistics in quality and reliability: lessons from the past and future opportunities. Reliability Engineering and System Safety, vol. 34 23-33 (1991).

Reliability prediction: help or hoax? Solid State Technology (August 1990).

Reliability prediction: state of the art review. IEEE Proc. vol. 133 Part A no. 4 (1986). (With L.N. Harris).

Effectiveness of formal reliability programmes. Quality and Reliability Engineering International, vol. 1 19-22 (1985).

Microelectronic systems reliability prediction. IEEE Trans Reliab. (USA) (April 1983).

Royal Air Force aero-engine logistics model. NATO conference on organisation of logistics systems, Luxembourg, 1972. (With J. Hough).
 
 
 
 
 
 
    

Quantifying Uncertainty

Return to books and articles

Originally published in Microelectron. Reliab., Vol. 35. Nos 9-10,

pp. 1347-1356, 1995

Copyright Elsevier Science Ltd.

QUANTIFYING UNCERTAINTY IN RELIABILITY AND SAFETY STUDIES

Patrick D.T. O’Connor

Abstract

The paper discusses the extent to which the quantification of reliability and safety of engineering systems can be considered to be credible and useful, and the reasons why the standard methods normally used can be misleading. The problem is discussed in relation to prediction and to measurement, for hardware, software and human aspects of engineering systems. Suggestions are made for improving the credibility and value of reliability and safety predictions and measurements.

Introduction

It is often necessary to try to quantify reliability. For new system designs we need to know whether requirements far reliability and safety will be achieved. Other factors, such as warranty costs and requirements for spares and test equipment depend on reliability. Contracts often stipulate that reliability must be predicted and demonstrated quantitatively, and the methods to be used may be specified.

There is an increasing tendency towards quantification of safety hazards, the objectives being to show that a proposed system meets criteria for safety, or to determine a “safety benefit” expressed as hazard reduction per unit of expenditure.

All of these efforts are of course well-intentioned. They are seen to be justified by Lord Kelvin’s aphorism that, if we cannot express a concept numerically, that our knowledge of it is a “poor kind”. They also reflect the fact that engineers have had scientific educations, and are therefore intuitively unhappy with concepts that are not subject to rules and amenable to quantification.

Failures and Physics

To quantify the reliability or safety of an engineering system, we must obviously predict or count the number of times it is likely to fail or has failed in ways that affect reliability or safety. However, reliability and safety are not concepts like mass or power that can be predicted using known laws, or measured with instruments. When we say that a product weighs X kilograms and delivers Y watts there is no ambiguity or uncertainty, beyond the normally expected and small variations inherent in engineering products and applications. However the mass and power have been predicted or measured, we know that all nominally identical products will have the same properties.

We also understand cause and effect relationships in relation to such physical parameters. We can use this knowledge to change the design in order to reduce weight or to increase power. A fundamental part of this knowledge is the universal agreement on definitions of the units concerned: all kilograms are the same as other kilograms. The effect of one is the same as the effect of any other. Therefore the rules of mathematics can be applied to them, and the results make sense and are believable.

In marked contrast, system failures are extremely diverse in kind and in their effects. It is usually difficult to have agreement on how to classify the kinds of events that relate to reliability and safety. For example, is a leaking hydraulic hose a failure or not? If it is, when did the leak become bad enough to be classed as a failure? We all know when a light bulb fails or a car fails to start. However, if a microprocessor chip behaves erratically at -40° C this would be classified as a failure if it is tested at this temperature, but would not be apparent if the chip were used in a PC in the office. In other words, failures, whether predicted or reported, are to a large extent the results of human perceptions and interpretations.

Causes of Failure

Failures of engineering systems have many causes. At the level of the individual components, such as electronic devices, springs, bearings, etc., failure can be caused by overstress (electrical, mechanical, thermal, etc.), by wearout mechanisms such as material fatigue, wear and corrosion, and by changes in properties due to age or use, such as drift in electronic component parameter values or loss of lubrication between moving parts. Failures can also occur as a result of the inherent variation that exists in parameter values and dimensions due to variation of production processes and tolerances. Variation also exists in the conditions in which systems operate. Parametric and environmental variations can have individual and combined effects that lead to failure.

Failures often result from effects at the system level, not involving actual component failure. Failures due to variation and tolerance effects are often in this category. Other causes are electromagnetic interference and timing problems in digital electronic systems, operator or maintenance errors, and incorrect diagnosis by test software.

A large proportion of the failures of components and systems is caused by shortfalls in the quality of manufacture, assembly and test. Such failures may or may not be related to the design. Again there is a great range of possibilities of such failures in modern systems, for example a defective solder joint that breaks open in service, a defect in a casting leading to early fatigue failure, and an electrical cable damaged during assembly.

Problems of Reliability Prediction

Predicting reliability is clearly a much more uncertain exercise than predicting physical aspects such as mass and power. We must take account of the great range of possible causes of failure, at the levels of parts, subsystems and system. We must take account of human aspects such as interpretation of events, human and process errors in production, and operation and maintenance influences. The rate of occurrence of failures due to such disparate and uncertain causes is inevitably subject to wide uncertainty, as experience shows.

Despite this, standard methods have been developed for reliability prediction. These are all based on variations of the “parts count” approach, which assumes that the reliabilities of all of the components of a system are known, and that the system reliability is the product of the component reliabilities (or the failure rate is the sum of the component failure rates). The most widely applied method is US Military Handbook 217 (Reliability Prediction for Electronic Equipment), though other methods and sources of “failure data” are also available. Methods such as MIL-HDBK-217 are grossly misleading and inadequate for several reasons. They claim to enable us to predict, for example, the failure rate per million hours of a diode while being fired from a cannon, or the failure rate per million hours of a mirror. Reference 1 deals specifically with the shortcomings of MIL-HDBK-217, but the main failings of all such methods are:

  1. They assume that all system failures are caused by component failures, and that all component failures cause system failures.
  2. They assume that failure data from the past will represent the future reliability, that is, they ignore the often very significant improvements in component reliability brought about by modern design and processes.
  3. They assume that all components have failure rates that are constant in time. This is particularly inappropriate for most mechanical components.
  4. They ignore the effects of quality of design and production, in relation to aspects such as protection, tolerancing, and quality control.
  5. They assume that failure is an inherent property of every component.

Of course every practical engineer knows that such assumptions are grossly incorrect for most components and systems. It is therefore no surprise that the methods that have been developed, and which are still being perpetrated (most notably the IEC Reliability Committee TC56 is preparing an international standard on electronics reliability prediction, based on these methods), generate “predictions” that turn out to be orders of magnitude in error in either direction, and in relation to the causes of failure experienced.

The methods are often defended on the ground that they “provide a basis for comparison of designs”. This is analogous to using a random collection of rocks as a standard for mass, or sensation as a standard for electric potential.

There are some situations in which the “parts count” or “block diagram” method can be applied with reasonable credibility. The criteria for such applications must be that the failure data on the components is accurate, and that there is good reason to believe that the rate and pattern of failures in future will be the same as in the past, or that the effect on reliability of any changes of design or application can be forecast with confidence. Again, practical engineers know very well how uncertain forecasts can be even under such circumstances. A recent example of such forecasting is the assessment of multiple engine shutdown on new commercial aircraft, in order to justify certification of twin engine aircraft for extended overwater flights. The engines and their application were the same as on earlier three and four engine aircraft, and there was considerable data on past operations.

Software

Software forms part of most modern engineering systems. Therefore predicting and measuring its contribution to reliability and safety has exercised the minds of system and software engineers and reliability academics. Reference 2 is a well known text on the subject.

Software is not prone to the causes of failure that affect hardware. It cannot degrade or break, and there is no variation from copy to copy. System failures occur when the software contains errors, which cause the system to malfunction under particular operating conditions. The rate at which such events occur depends upon the nature of the errors and the ways in which the system is used. Errors are created by the people involved in specifying, designing and coding. If a failure occurs its recurrence can be prevented by changing the software. Therefore reliability predictions or measurements based upon quantifying the numbers of errors or of failures during particular tests or use cannot be credible indicators of future reliability. Despite this, there is considerable literature on the topic, with various “experts” propounding and comparing different “models” for software reliability. The wide disparity of approaches and the fact that the methods proposed have not been taught or applied in mainstream software engineering indicate their poverty and lack of logical basis.

Human Error

Attempts have been made to generate databases of human error probabilities. Tables have been published giving figures such as “probability of misreading a dial” and “probability of misinterpreting an indicator”. Apart from the obvious question of what is the difference, such numbers imply that there is one kind of dial or indicator, one type of person, and one set of operating and use conditions. Such gross oversimplifications betray the lack of logic and scientific basis for such figures, and lead to justified scepticism and mistrust.

Predicting Safety

Systems are designed, built and operated to ensure that attendant hazards are kept as low as possible. Whilst users can tolerate failures every few hours or days for an aircraft or a nuclear power station, society expects that crashes and radioactive discharges will occur at far lower rates, bordering on never. In order to predict the likelihood of such rare events, “quantitative risk assessment” (QRA) methods have been developed. These are based on the techniques used for reliability prediction, with added refinements to take account of multiple failures, failures due to external events, and common cause failures. QRA deals with probabilities of the order of 10EXP(-6) to 10EXP(-10), and even lower, per year or other unit of exposure. Predicting the rate of occurrence of such rare events is obviously a highly uncertain exercise, particularly as they are really supposed never to occur. It is not surprising that when such events do occur, the causes are nearly always due to failures or combinations of failures that were not predicted, thus calling into question the value and credibility of the prediction.

Safety predictions and QRA include methods such as fault tree analysis and cause-consequence analysis to attempt to take account of multiple failures, human error, and external factors. Whilst these methods can be useful in helping to identify possible causes of hazards, it is important that undue attention to the quantitative aspects does not mislead. An interesting recent example of this is the recent certification of twin-engine passenger aircraft, described earlier. In analysing this type of situation it is essential that single engine failures that can cause an accident, such as the Amsterdam crash of the Boeing 747, are taken into account: the failure caused an accident because it led to the loss of two engines, but the same failure on a twin-engine aircraft would probably not have resulted in an accident.

Measuring Reliability and Safety

We can measure historic reliability and safety, and express them quantitatively. Methods have been developed and standardised for demonstrating reliability, the best known being MIL-STD-781. Such methods are subject to much of the same criticism as are predictive techniques based on parts level “data”. Further criticisms are:

  1. How should failures be counted if corrective action is taken to prevent them or to reduce their occurrence?
  2. How should we count failures that have not yet occurred, but which are very likely or certain to in future?

Of course any measurement of safety will need to be made over a long period of time, and it is never practicable to consider a quantified safety demonstration as part of a development programme. The best that can be done is to demonstrate that individual safety-critical failures cannot occur during the operating life of the system. This is the normal procedure, for example, in design and test of structural components subject to fatigue, such as pressure vessels, aircraft structures, and aircraft engine components. If safety-critical failures occur in service, corrective action is always taken, making any “measurement” based on data up to that time meaningless for forecasts of future behaviour.

Experts

One result of the emphasis on quantification of reliability and safety has been the growth in the number of “experts”, in industry, consulting and academia. These people inevitably find it difficult to let go of the geese that lay their golden eggs, and therefore defend the development and use of quantitative techniques and “models”. There has been for several years an angry controversy over MIL-HDBK-217, with some people and organisations condemning its use, while its sponsors and others denounce the critics and their arguments. Such controversies do not surround other aspects of engineering, where science and logic provide the criteria of acceptability of predictions and measurements.

The Right Way

Since it is necessary to attempt to predict the reliability and safety of new engineering systems, we must use methods that take account of the practical engineering and human realities involved. Foremost amongst these is to recognise and indicate the very wide uncertainty in such predictions. Any prediction that presents results to several significant figures is clearly unrealistic in its claims to accuracy and precision, and betrays lack of understanding of the wide uncertainties involved.

Predictions must be based upon intent. Since failures are ultimately caused by people, the ways in which they are managed, as designers, test engineers, producers, users and maintainers, will be the major determinant of reliability and safety. Coupled to this are other management aspects such as the contractual or other motivations involved and the investment in reliability and safety analysis and test.

Safety and reliability predictions must take account of technical risks, such as the use of new components or materials and operation at higher stresses or in different environments. Technical risks always widen the extent of uncertainty of the predictions. Reliability and safety predictions should be updated as experience is developed and as improvements are made. Reference 3, the recently updated UK Ministry of Defence guidelines on reliability prediction, provides a rational and practical framework for reliability prediction, and its principles can be extended to safety aspects.

A reliability demonstration test is never as effective as a test planned to accelerate the onset of as many failures as possible, in generating reliability improvement. Measuring reliability in use can be a more useful exercise than doing so during development, since the true operating conditions are applied, and the design will have been completed. The measured reliability can be used to identify need for improvement and to refine logistics. However, it is still important that attention is focused on causes of failure, rather than on measurement, since the measurement provides no information on which to base improvements.

References

  1. P.D.T. O’Connor, Reliability Prediction: Help or Hoax? Solid State Technology, August 1990.
  2. Musa, A. lannino, and K. Okumoto, Software Reliability Prediction and Measurement. McGraw-Hill, 1987.
  3. UK Defence Standard 00-41 (Issue 3 1993).

Return to books and articles