Patrick D.T. O'Connor

Engineering Management, Reliability, Test, Safety

Welcome to my Homepage.


Please note that I am no longer providing consulting or training. I am very sorry that my homepage died some years ago. However, I am now recovered from the problems that I faced back then and hope that this reduced version will enable us to be in contact again.

The main drivers of my teaching on engineering management, including quality, reliability and safety, are:

– Dilbert is far wiser than most management “gurus”. He is also experienced.
– Quality, reliability and safety are driven by good management and good engineering. Maths and stats play a very minor part.
– Standards like ISO9000 and others for reliability and safety are misleading and dangerous.

I hope that you will find the information and inspiration you seek. Please contact me if you think that I can help in any way.

  e-mail: pat@pdtoconnor.uk

See my CV (will appear at end of page)


LATEST

A Grain of Sand, an introduction to the wonders of science. Published on Amazon 2024. See below.

Old reliability engineer goes crazy! Sent away for incredible treatment! Read about it!
Walter Minion’s Therapy. See below.

The 6th. edition of Practical Reliability Engineering will be published in early 2024.

Test Engineering was published in 2001: the only book that covers the whole engineering and management spectrum of testing.

The New Management of Engineering was published in 2004, and is now re-published by Amazon/Kindle. It is the only book that maps the teaching of Peter Drucker on to the practice of engineering management.

  
 BOOKS, JOURNALS AND PAPERS

This section describes the books I have written , the journal Quality and Reliability Engineering International, and other articles.

A Grain of Sand. Amazon /Kindle 2024

An introduction to the wonders of science, from the Big Bang to human life, with emphasis on the connectness of nature, and the mysteries of creation.

The New Management of Engineering. Patrick D.T. O’Connor. Amazon/Kindle 2022.

Managing engineering is more difficult, more demanding and more important than any other activity in modern society. The book explains how the principles of Peter Drucker’s “New Management” should be applied to the human and technological aspects of engineering. It provides fresh insights into the management of design, test, manufacture and use. It explains the poverty of some of the ideas that dominate much of modern management.
 It is the only book on the subject that truly reflects the realities of engineering and explains how world-class engineering companies operate.

Practical Reliability Engineering. Patrick D.T. O’Connor and Andre Kleyner (John Wiley, Fifth edition 2012). 

The only book that treats reliability as essentially an engineering and management subject. Probably the world’s most popular book on the subject, and now further updated and expanded. Andre Kleyner provided descriptions of reliability analysis software and other updates. The book covers all of the requirements of the ASQ examination for Certified Reliability Engineer. An answers manual for the student questions is available from the publisher.

Test Engineering.  Patrick D.T. O’Connor. John Wiley 2001. 

Testing is an essential, expensive and difficult part of engineering design, development, manufacture and support. Yet it is rarely taught as part of engineering training, it is ignored in books on engineering management, and until now there have been no books that cover testing philosophy, methods, technology aspects, economics, and management. This new book is the first to do so. It emphasises an integrated, multidisciplinary approach, and the use of highly accelerated stress testing.

In My Humble Opinion

A collection of my writings, including editorials, book reviews, papers and other pearls of wisdom.

Walter Minion’s Therapy

A combination of black comedy, allegory, high adventure, and exploration of human relationships in a crazy world.

Walter Minion is an archetypical retired engineer. His life is gentle and domesticated. But gradually little stresses mount, culminating in his bizarre attempt at suicide. He is incarcerated in a mental institution where he is treated in inexplicable ways by strange characters. His enforced therapy is to undertake a voyage around the world, alone, on a small yacht. In mid-Atlantic a wondrous girl swims alongside and climbs aboard. Spray is feminine perfection, but Walter discovers that she is also weirdly unnatural.
The two are caught up in a series of wild adventures as they cross oceans, make exotic landfalls and survive terrifying dangers. They encounter more people: good, evil and mad. Throughout their odyssey Spray holds out tantalising prospects of submission and erotic rapture. Walter wrestles with the emotional conflicts that ensue, and with humanity’s stupidity and hatred, beauty and love. Walter’s therapy sails to a surprising finale, a counterpoint of tragedy and joyful triumph.
But what is reality, and what is delusion?

I have also written a prequel, Walter Minion’s Secret Life, which is his story leading up to his going mad. He works on nuclear weapons, is involved in international skulduggery, and other life-changing episodes.

ALL OF MY BOOKS ARE AVAILABLE ON AMAZON.COM. Click on the titles.

 
Quality and Reliability Engineering International. Editors Aarnout C. Brombacher, Douglas Montgomery and Loon Ching Tang. John Wiley and Sons Ltd.
The bimonthly journal that links quality and reliability engineering, with the emphasis on practical application and modern technology. Includes reviews, special issues, events calendar, news digest, and more.
I was UK Editor until September 1999. Past issues contain many of my editorials and reviews.



Encyclopaedia Chapter:

Quality and Reliability Engineering. Encylopaedia of Physical Science and Technology (Academic Press).

Papers: (will appear at end of page when clicked)

IEC/ISO61508: Letter on the new standard on electronics/software safety, published in IEEE Spectrum Aug 2000. Read it? IEC61508

Reliability Past, Present and Future. Paper published in IEEE Trans Reliability, 2001. Read it? Reliability 2000

Is scientific management dead? An article on the adverse effects of reliance on “scientific” methods, typified by ideas such as business process re-engineering (BPR), ISO9000, MBAs, etc. It has not been published, because journals like Harvard Business Review and Management Today would not accept it. (Academic journals are reluctant to publish opinions that clash with the accepted wisdom). Read it? smdead

Standards in reliability and safety engineeringReliability Engineering and System Safety (1998). Read it? standards

ISO9000: help or hoax? Quality World (1991)
 Read it? iso9000

Quantifying uncertainty in reliability and safety studies. Society of Reliability Engineers Symposium, Arnhem, 1993 (keynote paper). Read it? quantifyinguncertainty

Achieving World Class Quality and Reliability: Science or Art? Quality World (1993)
 Q&R Science or Art?

Quality and reliability: illusions and realities. Quality and Reliability Engineering International, vol. 9 163-168 (1993).

Statistics in quality and reliability: lessons from the past and future opportunities. Reliability Engineering and System Safety, vol. 34 23-33 (1991).

Reliability prediction: help or hoax? Solid State Technology (August 1990).

Reliability prediction: state of the art review. IEEE Proc. vol. 133 Part A no. 4 (1986). (With L.N. Harris).

Effectiveness of formal reliability programmes. Quality and Reliability Engineering International, vol. 1 19-22 (1985).

Microelectronic systems reliability prediction. IEEE Trans Reliab. (USA) (April 1983).

Royal Air Force aero-engine logistics model. NATO conference on organisation of logistics systems, Luxembourg, 1972. (With J. Hough).
 
 
 
 
 
 
    

Reliability 2000

Return to Books and Articles

RELIABILITY PAST, PRESENT AND FUTURE*

(*This paper was published in the IEEE Transactions on Reliability, 2001)

Patrick D.T. O’Connor

Abstract

The paper reviews the nature of reliability in relation to the causes of failures of engineering products, explains how most of the methods that have been developed and applied by reliability and quality specialists have been misleading and ineffective, and makes suggestions for the way ahead.

Keywords: failures, reliability, quality, variation, ISO9000, engineering management

INTRODUCTION

In the final paragraph of my book “Practical Reliability Engineering”, originally published in 1981, (Reference 1) I wrote:

“It is notable that no undemocratic state has been able to make any significant contribution to the reliability and quality revolution. Quality represents the essence of freedom – freedom to make decisions at work and as a customer. Centralised bureaucratic state systems do not allow this freedom. The new techniques for controlling the reliability of design and quality of production enable us to produce complex but reliable products, but the techniques are dependent for their success on the motivation that comes only with personal freedom”.

Not long afterwards the democracies eventually won the Cold War, proving once and for all that attempts to regulate human behaviour by dogma and coercion fail by comparison with the liberation of human talent and motivation that is inherent in free market economies. Peter Drucker, in 1955, (Reference 2) had explained the poverty of “scientific management”, the doctrine taught by Frederick Taylor, which had provided the basis for demarcation at work, the alienation of workers from managers, and the growth of trades unionism. Scientific management relegated humans at work to the status of automatons (albeit not necessarily badly treated in a physical sense) and attributed the capability to plan and decide only to “managers”. Marxism germinated and grew on the ground prepared by Taylorism. It has always puzzled me that Karl Marx is included in writings on Western philosophy. Marxism was a socio-political argument, that could be imposed and maintained only by force, not a philosophy. Now we all appreciate how counterproductive and dangerous were these insidious ideas: they appealed to those who sought power and who did not realise or acknowledge the inventive and productive spirit that exists in all human intellect. Drucker taught that all workers are managers, and that the role of higher management is to develop and mobilise these talents for the benefit of the firm. Drucker’s “new management” liberated workers at all levels. Later Deming (Reference 3) based his teaching on quality and productivity on Drucker’s ideas.

So now we all know that regulation and control of people at work, whether imposed by the state or by management, is discredited as a failed philosophy. We know this, but do we live accordingly? Or do regulation and scientific management still influence our thinking? Taylor’s legacy is still manifest in a number of ways that are counterproductive in relation to the performance of people and businesses (Reference 4). This is particularly so in the field of quality and reliability, as I will explain later in this article.

WHAT IS RELIABILITY?

The commonsense perception of reliability is the absence of failures. We know that failures have many different causes and effects, and there are also different perceptions of what kinds of events might be classified as failures. The burning O-ring seals on the Space Shuttle booster rockets were not classed as failures, until the ill-fated launch of Challenger. We also know that all failures, in principle and almost always in practice, can be prevented.

There are three kinds of engineering product, from the perspective of failure prevention:

1. Intrinsically reliable components, which are those that have high margins between their strength and the stresses that could cause failure, and which do not wear out within their practicable lifetimes. Such items include nearly all electronic components (if properly applied), nearly all mechanical non-moving components, and all correct software.

2. Intrinsically unreliable components, which are those with low design margins or which wear out, such as badly applied components, light bulbs, turbine blades, parts that move in contact with others, like gears, bearings and power drive belts, etc.

3. Systems which include many components and interfaces, like cars, dishwashers, aircraft, etc., so that there are many possibilities for failures to occur, particularly across interfaces (e.g. inadequate electrical overstress protection, vibration nodes at weak points, electromagnetic interference, software that contains errors, etc.).

It is the task of design engineers to ensure that all components are correctly applied, that margins are adequate (particularly in relation to the possible extreme values of strength and stress, which are often variable), that wearout failure modes are prevented during the expected life (by safe life design, maintenance, etc.), and that system interfaces cannot lead to failure (due to interactions, tolerance mismatches, etc.). Because achieving all this on any modern engineering product is a task that challenges the capabilities of the very best engineering teams, it is almost certain that aspects of the initial design will fall short of the “intrinsically reliable” criterion. Therefore we must submit the design to analyses and tests in order to show not only that it works, but also to show up the features that might lead to failures. When we find out what these are we must redesign and re-test, until the final design is considered to meet the criterion.

Then the product has to be manufactured. In principle, every one should be identical and correctly made. Of course this is not achievable, because of the inherent variability of all manufacturing processes, whether performed by humans or by machines. It is the task of the manufacturing people to understand and control variation, and to implement inspections and tests that will identify non-conforming product.

For many engineering products the quality of operation and maintenance also influence reliability.

The essential points that arise from this brief and obvious discussion of failures are that:

  1. Failures are caused primarily by people (designers, suppliers, assemblers, users, maintainers). Therefore the achievement of reliability is essentially a management task, to ensure that the right people, skills, teams and other resources are applied to prevent the creation of failures.
  2. Reliability (and quality) are not separate specialist functions that can effectively ensure the prevention of failures. They are the results of effective working by all involved.
  3. There is no fundamental limit to the extent to which failures can be prevented. We can design and build for ever-increasing reliability.

Deming explained how, in the context of manufacturing quality, there is no point at which further improvement leads to higher costs. This is, of course, even more powerfully true when considered over the whole product life cycle, so that efforts to ensure that designs are intrinsically reliable, by good design and effective development testing, generate even higher payoffs than improvements in production quality. The “kaizen” (continuous improvement) principle is even more effective when applied to up-front engineering.

We observe the results of this practical philosophy every day. Modern complex products such as microprocessors, car and aircraft engines, spacecraft, electronic systems, etc. are extremely and increasingly reliable and economic. Their development and production are based upon recognition and application of the essential points listed above.

Reliability and Time

Reliability is sometimes referred to as “quality in the time dimension”, since it is determined by the failures that do or do not occur during the life of the product. The most important task in reliability engineering is to look forward in order to make design, development and manufacture reliable: it is necessary to anticipate and prevent future failures. However, the problem with failures that might happen in the future is that we usually do not know what they might be, and they are the results of oversights and mistakes, which we expect (or hope) will not be made. It is usually clear what we must do in order to create the future in terms of parameters like performance and price of the next product, but thinking ahead into the unknowns and uncertainties of future failures is difficult. It can also be perceived as a negative activity: project engineering is an activity based on optimism, but failure prevention work must be based on pessimism and scepticism. (The “parts count” approach to predicting reliability implies that we do know what failures will occur, and how often. This myth will be dealt with later). Therefore reliability engineering is often perceived as unproductive and esoteric.

Of course it is also important to look at the past and the present, in order to reduce the problems of today and to provide lessons for the future. Failure analysis is part of that. However, it must not be allowed to dominate the reliability effort, otherwise we will go on reaping the same fields of weeds on new products. In fact things could get even worse, as we stretch technology, compress timescales, and compete for markets.

This point that reliability engineering is concerned with the uncertain future, whilst most other engineering management is concerned with the present or with the more predictable future is of great philosophical and practical importance. It is not easy for managers to think long-term about reliability, especially when they are not engineers, and when their motivations are geared to short-term objectives. This is why reliability engineering is nearly always inadequately and inappropriately managed and resourced. The management dimension is crucial, and without it reliability engineering can degenerate into ineffective design analysis followed later by panic failure analysis, with minimal impact on future business. Training in reliability engineering, when given just to staff, is not very effective. Reliability philosophy and methods should always be taught first to top management, and top management must drive the reliability effort.

Reliability and Variation

All engineering parameters (strengths, electrical parameters, dimensions, etc.) are variable. So are all environmental and other operating conditions (temperatures, vibration-induced stresses, electrical load, etc.). These variations can affect reliability whenever a single parameter or condition is exceeded. Failures can also be caused by the interactions of two or more variables, such as stress and strength, or of electronic component parameters that allow a circuit to become unstable above a certain temperature.

Variation in quality and reliability engineering is usually more complex and difficult to deal with than most “natural” variation, because it seldom follows the conventionally-taught mathematical form of the s-normal distribution. The s-normal pdf has values between + ¥ and -¥ . Of course a machined component dimension cannot vary like this. The machine cannot add material to the component, so the dimension of the stock (which of course will vary, but not by much) will set an upper limit. The nature of the machining process, using gauges or other practical limiting features, will set a lower limit. Therefore the variation of the machined dimension will be curtailed. Only the central part might be approximately s-normal. In fact all variables, whether naturally-occurring or resulting from engineering or other processes, are curtailed in some way, so the s-normal distribution, while being mathematically convenient, is actually misleading when used to make inferences well beyond the range of actual measurements, such as the probability of meeting an adult who is one foot tall.

There might be other kinds of selection process. For example, when electronic components such as resistors, microprocessors, etc. are manufactured, they are all tested at the end of the production process and are then categorised and sold according to the measured values. Typically, resistors that fall within + / – 2% of the nominal resistance value are classified as precision resistors, and those that fall outside these limits, but within + / -10% become non-precision resistors, and are sold at a lower price. Those that fall outside + / -10% are scrapped. Microprocessors are sold as, say, 166MHz, 200MHz, 400MHz, etc. devices, depending on the maximum speed at which they function correctly on test, having all been produced on the same process. The different maximum operating speeds are the result of the variations inherent in the process of manufacturing millions of transistors and capacitors and their interconnections, on each chip on each wafer. The technology sets the upper limit for the design and the process, and the selection criteria the lower limits. Of course, the process will also produce a proportion that will not meet other aspects of the specification, or that will not work at all.

The variation might be unsymmetrical, or skewed. There are mathematical pdf’s that represent such distributions, such as the lognormal and the Weibull distributions. However, it is still important to remember that these mathematical models will still represent only approximations to the true variations, and the further into the tails that we apply them the greater will be the scope for uncertainty and error.

The variation might be multi-modal rather than unimodal as represented by distribution functions like the s-normal, lognormal and Weibull functions. For example, a process might be centred on one value, then an adjustment moves this nominal value. Backlash or hysteresis can also generate bimodal outputs. A component might be subjected to a pattern of stress cycles that varies over a range in typical applications, and to a further stress under particular conditions, for example resonance, lightning strike, etc.

The parts of the distributions of most concern to quality and reliability engineers are the extreme values in the “tails”. We are concerned by high stresses, high and low temperatures, slow processors, weak components, etc. However, this is where the data is always less frequent and more uncertain, and where conventional statistical methods are most misleading. People like life insurance actuaries, clothes manufacturers and pure scientists are interested in averages and standard deviations, as represented by the behaviour of the bulk of the data. Since most of the sample data, in any situation, will represent this behaviour, they can make credible assertions about population parameters. However, the further we try to extend the assertions into the tails, the less credible they become, particularly when the assertions are taken beyond any of the data. Engineers often have only small samples to measure or test, so that the data available on extreme values is very limited or non-existent.

Interaction effects can be difficult to predict, to detect and to understand. Interactions involve the tails of distributions, so the uncertainties of the effects are further increased.

Many variables can change over time. Mechanical strength can vary over time as a result of fatigue, wear or corrosion, dielectric strength can change over time and applied stress, etc. The relevant measure of “time” in any application might be hours, load cycles, distance, etc., or combinations of these. When variation changes over time the uncertainty of the distribution tails increases disproportionately. Parameter distributions can also vary batch to batch, supplier to supplier, etc.

Variation of engineering parameters is, to a large extent, the result of human performance. Factors such as measurements, calibrations, accept/reject criteria, control of processes, etc. are subject to human capabilities, judgements, and errors. People do not behave s-normally.

These are the realities of variation that matter in engineering, and they transcend the kind of basic statistical theory that is generally taught and applied. Most engineering teaching covers no more than conventional statistics, and engineers therefore tend to be uncertain about how to deal with the realities of variation and sceptical about the application of statistical methods. The use of conventional mathematical statistics to attempt to understand the nature, causes and effects of variation in engineering can be misleading.

Despite all of these reasons why conventional statistical methods can be misleading if used to describe and deal with variation in engineering, they are widely taught and used, and their limitations are hardly considered. Examples are:

  • Most textbooks and teaching on SPC emphasise the use of the s-normal distribution as the basis for charting and decision-making. They emphasise the mathematical aspects, such as probabilities of producing parts outside arbitrary 2s or 3s limits, and pay little attention to the practical aspects discussed above.
  • Many contributions to the literature on statistical process control contain “exact” calculations of values such as proportions outside limits, based upon the unrealistic assumption of s-normality for the processes.
  • Methods of “probabilistic design” are taught and applied, that involve precise determinations of failure probabilities for items of variable strength subjected to variable stress. These assume that the relevant distributions are known far into the tails, and they ignore the practical limitations discussed above.
  • Typical design rules for mechanical components in critical stress application conditions, such as aircraft and civil engineering structural components, require that there must be a specified factor of safety between the maximum expected stress and the lower 3s value of the expected strength. This approach is really quite arbitrary, and oversimplifies the true nature of variations such as strength and loads, as described above. Why, for example, select 3s ? If the strength of the component were truly normally distributed, about 0.1% of components would be weaker than the 3s value. If few components are made and used, the probability of one failing would be very low. However, if many are made and used, the probability of a failure among the larger population would increase proportionately. If the component is used in a very critical application, such as an aircraft engine suspension bolt, this probability might be considered too high to be tolerable.
  • The so-called “six sigma” approach to achieving high quality is based on the idea that, if any process is controlled in such a way that only operations that exceed plus or minus 6s of the underlying distribution will be unacceptable, then only about one per million operations will fail. The exact quantity is based on arbitrary and generally unrealistic assumptions about the distribution functions, as described above. (“Six sigma” entails other features, such as the use of a wide range of statistical and other methods to identify and reduce variations of all kinds, and the training and deployment of specialists called “six sigma black belts”. It is not altogether a bad approach, but it is not the best, it is based to a large extent on “scientific” management thinking, and it is heavily hyped by consultants).

Reliability and Quality of Production

The management and achievement of reliability cannot sensibly be divorced from production quality. It is common experience that a large proportion of the failures that we experience are caused by inadequate manufacture. For example, a missile system had a well-documented reliability in military use of 90%. 90% was also the “predicted” reliability, using the approved “models” and “data”, so no one complained. When a new production operation was started, the reliability instantly rose to over 95%. As Deming would have pointed out, 10%, then 5%, failed because they were built differently to those that worked. The failures almost certainly cost more to build than the successes. By improving build quality reliability was increased and costs reduced. There was no fundamental reason why quality and reliability could not have been improved even further. What was the use of the reliability prediction?

Total quality management (TQM) is the philosophy of design for production and control of production operations, based upon the ideas taught by leaders such as Shewhart (Reference 5), Deming, Ishikawa, Juran, Hutchins, Imai, and Crosby, and initially applied in Japan in the late 1950’s. In this approach, every person in the business becomes committed to a never-ending drive to improve quality and productivity. The drive must be led by top management, and it must be vigorously supported by intensive training, the appropriate application of statistical methods, and motivation for all to contribute. The total quality concept links quality to productivity. It has been the prime mover of the Japanese industrial revolution, and it is fundamental to the survival of any modern manufacturing business competing in world markets. TQM is based firmly on the “new management” of Peter Drucker.

We have seen how TQM has generated enormous gains in reliability of complex products like cars, machines and electronic devices and systems. When production quality is managed effectively complexity is not the enemy of reliability as it used to be perceived.

WRONG THINKING ABOUT RELIABILITY

Reliability Prediction

It follows from the discussion of what generates reliability that it cannot be predicted, as though it were a “parameter” of a design, in ways that are helpful or meaningful. To state, for example, that a design has an MTBF of X hours ignores the causes, consequences and costs of failures, and what can be done to reduce them. By contrast, a statement that products built to the design will weigh Y kilograms is fully meaningful and credible. The trap of attempting to quantify reliability was created when Kelvin wrote “when you can measure what you are speaking about and express it in numbers, you know something about it. When you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind”. Kelvin was right, but only because he was speaking as a scientist. Engineers are applied scientists. However, Kelvin’s logic does not apply to quality and reliability, which are the results of human behaviour and perceptions. The uncertainties inherent in human behaviour and perceptions overwhelm any mathematical “models” of reliability.

We can predict the future only if we know that the underlying conditions that created the past and present conditions will be unchanged, and that we fully understand the relevant cause-and-effect relationships. This is the case in pure and applied science, such as volts drop across a resistor. However, neither of these criteria hold for reliability. There are no forces of nature that constrain designers and others to repeat the mistakes of the past, or prevent them from making new ones. New products entail new technologies. The methods and “models” that have been developed and used for “predicting” reliability, such as US MIL-HDBK-217, Bellcore, etc., are fraudulent and highly misleading. What was the value of the reliability prediction of the missile mentioned above? Most industry sectors have either never used or have stopped using such methods. Despite this, some organisations and reliability “specialists” continue to apply them, and when the US DOD decided to stop relying on Military Standards for nearly all procurement, they decided to retain MIL-HDBK-217 “for guidance, until a suitable commercial equivalent is available”.

The only logically correct way to predict the likely reliability of a new product is to base the prediction on the management objectives and commitment, in relation to risks and uncertainties (Reference 6).

Further examples of the futility and error of inappropriate quantification of reliability and quality are:

  • The creation of “models” for the reliability of software, expressed as the probability of failure over time (time has no meaning in the context of software operation) or of “error count” (errors are created by people, and different errors can have widely different consequences. The Ariane 5 launch control software contained one error, but that was enough to destroy the entire vehicle and its payload).
  • Extremely complex Markov models for the reliability and availability of systems and networks, when simple empirical formulae, or even qualitative statements, would usually be more helpful, and could actually be understood by managers and engineers.
  • An overwhelming proportion of contributions to reliability symposia and literature consists of esoteric papers that provide “exact” mathematical formulations that have little or no practical value.

Reliability Testing

All engineering products must be tested during development to ensure that the design is correct in relation to performance, reliability, safety and other requirements. Then production items must be tested to ensure that only good ones are shipped. The logical and only effective approach to development testing for reliability (including durability and safety) is to generate failures as quickly and economically as practicable, so that product and process design weaknesses are discovered and corrected. This in turn implies that the stresses should be as high as can be applied, within the limits of the technology. For example, there is no point in testing an electronics assembly at temperatures exceeding the solder melting point. The same logic applies to production testing, with the proviso that the stresses applied must not damage good items, but only cause weak ones to fail.

This philosophy of test has been applied, for example in structural and fatigue testing and in environmental stress screening (ESS) of electronic assemblies. However, it is only recently that the logic has been fully applied, in the methods of highly accelerated life testing (HALT) and highly accelerated stress screening (HASS) developed by Hobbs (Reference 7).

It is an intriguing fact that the subject of testing is largely untaught on engineering degree courses, and that there are no books that cover the subject in terms of philosophy, physics, technologies, methods, economics and management. This probably explains why so much testing in industry, particularly in relation to reliability, is based upon inappropriate thinking and blind adherence to standards and traditions.Recently I visited a company in the advanced communications sector, which was conducting long-term tests of multiple samples of expensive new production in a large environmental chamber. They explained that they were doing it “to measure the reliability”. I asked them if they had actual in-service reliability data, and they showed it to me: they already knew the reliability being achieved. I asked them if they were finding any failures on test that were different to those in service. They said no. The test was not a requirement of any of their customers. I explained that they were performing a very expensive test, and delaying shipments, to obtain zero information or improvement. Months later they were still doing the test, because “it was in their procedures”. I have come across numerous other examples of sub-optimal testing, in a wide range of industries.

Methods for “demonstrating” reliability, such as probability ratio sequential testing (PRST), the basis of US MIL-STD-781, are misleading because they imply that all failures have the same “value”, that causes of detected failures will not be removed, that no new failure causes will arise, and that the pattern of failures over time is constant. In nearly every case these implications are false. Products are tested to “measure” reliability, when the proper objective of development testing should be to find opportunities for improving reliability, by forcing failures using accelerated stresses.

My forthcoming book (Reference 8) is intended to fill the need for a multidisciplinary book on testing in engineering.

Reliability Teaching and Literature

Despite these facts as described, nearly all reliability training at universities is provided by departments of mathematics or statistics. The reliability literature is overwhelmingly mathematical, in journals and at conferences.

International Standards for Reliability

ISO/IEC60300 is the international standard for “dependability” a term that is supposed to include reliability, maintainability, availability and safety (“RAMS”). Every aspect of this family of documents reflects the kind of over-emphasis on inappropriate quantitative methods described above. The people who make up the drafting committee (IEC TC 56) and its various working groups seem to be unaware of the criticisms of these ideas, or of the fact that the companies which lead the world in reliability do not apply them.

ISO/IEC61508 is a recently released international standard for assurance of safety of systems that include electronics and software. This standard also demands the application of a wide range of inappropriate and controversial methods, including requirements for “independent” analyses of designs. Few companies involved in the creation of safety-related hardware and software seem to be aware of the new requirements, which have been written by “experts” who seem to have been divorced from the practical engineering and management realities.

ISO9000 AND MANAGEMENT OF QUALITY

The international standard for quality systems, IS09000, has been developed to provide a framework for assessing the management system which an organisation operates in relation to the quality of the goods or services provided. The concept was developed from the US Military Standard for quality, MIL-Q-9858, which was introduced in the 1950’s as a means of assuring the quality of products built for the US military services. In the ISO9000 approach, suppliers’ quality management systems (organisation, procedures, etc.) are audited by independent assessors, who assess compliance with the standard, and issue certificates of registration. Today many organisations and companies rely on ISO9000 registration to provide assurance of the quality of products and services they buy and to indicate quality of their products and services.

The major difference between ISO9000 and its defence-related predecessors is not in its content, but in the way that it is applied. The suppliers of defence equipment were assessed against the standards by their customers. By contrast, the ISO9000 approach relies on “third party” assessment. Certain organisations, such as the US Underwriter’s Laboratories (UL), the British Standards Institution (BSI), Lloyds Register, and several others, are “accredited” by the appropriate national accreditation services, which entitles them to assess companies and other organisations. The justification given for third party assessment is that it removes the need for every customer to perform his own assessment of all of his suppliers. However, the total quality philosophy demands close partnership between supplier and purchaser. A matter as important as quality cannot safely be left to be assessed spasmodically by third parties, who are unlikely to have the appropriate specialist knowledge, and who cannot be members of the joint supplier-purchaser team.

The other main difference is that IS09000 is applied to every kind of product and service, and by every kind of purchasing organisation. Today, schools and colleges, consultancy practices, local government departments, and window cleaners, in addition to large companies in every industrial sector, are being forced by their customers to become registered or are deciding that registration is necessary for future business success. Some major industry sectors, notably the “big 3” US automotive companies, and some US telecommunications and aerospace companies have developed industry-specific variants (QS9000, TC9000, AS9000). It will be interesting to see how QS9000 influences the competitive position of the American automakers. Their Japanese competitors have not followed this approach,

ISO9000 does not specifically address the quality of products and services.      It describes, in very general and rather vague terms, the “system” that should be in place to assure quality. In principle, there is nothing in the standard to prevent an organisation from producing poor quality goods or services, so long as written procedures exist and are followed. Obviously an organisation with an effective quality system would normally be more likely to take corrective action and to improve processes and service, than would one which is disorganised. However, the fact of registration cannot be taken as assurance of quality. It is often stated that registered organisations can, and sometimes do, produce “well-documented rubbish”. An alarming number of purchasing and quality managers, in industry and in the public sector, seem to be unaware of this fundamental limitation of the standard.

The effort and expense that must be expended to obtain and maintain registration tend to engender the attitude that optimal standards of quality have been achieved. The publicity that typically goes with initial certification of a business supports this belief. The objectives of the organisation, and particularly of the staff directly involved in obtaining and maintaining registration, are directed at the maintenance of procedures and at audits to ensure that staff work to them. It becomes more important to work to procedures than to develop better ways of working.

Since its inception, ISO9000 has generated considerable controversy. Some companies and individuals are questioning the value of the exercise, as they do not see how the expensive process of preparing documentation and undergoing registration improves the quality of their products and services, and they also query the benefits in relation to the high costs of compliance and questionable effectiveness. The evidence is, however, variable. Some organisations have generated real improvements as a result of registration, and some consultants and registration bodies do provide good service in quality improvement.

The leading teachers of quality management all argue against the “systems” approach to quality, and the world’s leading companies do not rely on it. So why is the approach so widely used? The answer is partly cultural and partly coercion.

The cultural pressure derives from the tendency to believe that people perform better when told what to do, rather than when they are given freedom and the necessary skills and motivation to determine the best ways to perform their work. This belief stems from the concept of scientific management, as described earlier.

The coercion to apply the standard comes from several directions. For example, the UK Treasury guidelines to public purchasing bodies states that they should “consider carefully registered suppliers in preference to non- registered ones”. In practice, many agencies simply exclude non- registered suppliers, or demand that bidders for contracts must be registered. All contractors and their subcontractors supplying the UK Ministry of Defence must be registered, since the MoD decided to drop its own assessments in favour of the third party approach, and the US Defense Department has recently decided to apply ISO9000 in place of MIL-STD-Q9858. Several large companies, as well as public utilities, demand that their suppliers are registered. The European Community CE Mark regulations encourage ISO9000 registration.

Other malign effects of ISO9000 include the development and growth of a substantial industry of agencies, registration bodies and consultants, parasitic on productive industry. In the UK the annual direct costs of registration exceed $150 million and are growing rapidly. The “quality” literature, as represented by the journals of the major professional societies for the discipline, has become overwhelmingly devoted to ISO9000 and similar standards. Articles, training courses, etc. on traditional quality control and improvement activities such as measurement, SPC, etc. have almost disappeared. There is more advertising for ISO9000 services and training in these journals than for all other services and products combined.

Defenders of ISO9000 say that the total quality approach is too severe for most organisations, and that ISO9000 can provide a “foundation” for a subsequent total quality effort. However, the foremost teachers of modern quality management all argue against this view. (It is notable that none of these serve on the national or international committees that prepare and “update” the standard). They point out that any organisation can adopt the total quality philosophy, and that it will lead to far greater benefits than will registration to the standard, and at much lower costs. The ISO9000 approach seeks to “standardise” methods which directly contradict the essential lessons of the modern quality and productivity revolution, as well as those of the new management.

It is notable that ISO9000 is very little used in Japan, and then mainly by companies which perceive that it will provide advantages in Western markets, not because they believe that it will lead to improvements in quality. Companies that embrace TQM set standards for product and service quality, internally and from their suppliers, far in excess of the requirements of ISO9000. These are aimed at the actual quality achievements of the products and services, and at continuous improvement in these levels. Much less emphasis is placed on the “system”.

The recent changes to ISO9000 (“ISO9000/2000”) do not deal with these fundamental criticisms. They will lead to higher costs and greater controversy. Quality and reliability of products and services will not be assured or improved.

WHERE ARE THE RELIABILITY HEROES?

Names like Shewhart, Juran, Deming, Ishikawa and Crosby are recognised worldwide as contributors to quality philosophy and management. Their reputations and influence extend far beyond narrow perceptions of “quality” to the highest levels of industry and management. They all emphasised and taught practical, realistic, effective approaches. It is interesting by contrast that no similar “heroes” of reliability have emerged over the years since the discipline has been in existence. There have of course been notable contributors in specific areas, such as Shainin and Taguchi in test design and analysis, Weibull, Nelson and Crow in data analysis, Hobbs in accelerated test (mentioned earlier), and others in areas such as failure physics, etc. However, no name is associated with teaching and applying the wider philosophy of excellence, as taught by the “quality” heroes, to the upstream engineering activities of design and development, and to the higher levels of management.

We know, as Deming taught and as has been widely demonstrated, that continuous improvement in manufacturing quality (“kaizen”) leads to continuous gains in productivity and competitiveness. The potential gains from kaizen in engineering design and development are, in most cases, even greater. Some companies recognise this, but most still apply, in varying degrees, the inappropriate and sub-optimal approaches to design and development for reliability that have been described above.

Reliability needs a hero to lift the discipline out of its over-reliance on the ideas and methods that have misled and detracted from practical achievement, and which have resulted in justified scepticism and distrust from the wider engineering profession. The future must be based upon effective engineering and management applied to the whole product cycle, in addition to application of the reliability and quality engineering techniques that are effective. Therefore the hero must have a reputation that goes beyond the reliability “profession”.

THE WAY AHEAD

  • Hard as it may seem, the reliability and quality “professions” must accept that they have, in the ways discussed above, misled engineers and managers about how product reliability should be managed and achieved. We must realise and teach that reliability is achieved by excellent engineering, in the widest sense, with the objective of minimising all causes of failures. “Scientific” management of quality and reliability must be replaced by the methods that are consistent with Drucker’s new management.
  • Reliability must be managed as an integral aspect of total product design, development/test and manufacture. Since most modern engineering systems buy high proportions (typically 70% – 80%) of their failures from their sub-system and component suppliers, this integrated team approach must be extended to all key suppliers. Relying on concepts like ISO9000 provides almost no assurance in this respect.
  • In the integrated engineering approach to new product design, test and manufacture we must ensure that, as far as practicable, all variations that can affect performance, yield, reliability, durability and costs are identified, understood and controlled. At the same time we must also appreciate and teach the extent to which traditional mathematical/ statistical methods can misrepresent the true nature of variations and interactions in engineering.
  • We must eliminate the over-reliance on the “numbers game” in reliability, in relation to prediction, modelling and measurement. Contributions to symposia and journals should be subjected to tests of practical reality and applicability.
  • Reliability testing should be taught, applied and managed as an activity to stimulate failures as quickly and economically as practicable, rather than to generate statistics. This can be performed only through the application of accelerated combined stresses, not by using “typical” stresses. The methods of highly accelerated life testing (HALT) and highly accelerated stress screens (HASS) should be applied (References 7,8).
  • Reliability must be taught as an integral part of all engineering curricula, by engineering teachers, not by mathematicians. The curriculum should include manufacturing quality aspects and maintenance, and practical understanding of variation. The ASQ curricula for reliability and quality engineering already exist to provide the framework for this approach. Courses that do not cover the ASQ curricula should not be accredited.
  • We must stop the development and use of standards for reliability and quality, such as ISO/IEC60300, ISO/IEC61508 and ISO9000.
  • The main professional societies for engineering and quality should combine to eliminate the inappropriate ideas and methods that have been developed, taught and applied, and to force through the adoption of the practical, relevant methods. THEY ARE THE METHODS THAT ACTUALLY WORK.
  • We need a hero!

References:

  1. P.D.T. O’Connor: Practical Reliability Engineering, John Wiley and Sons Ltd. (3rd. edition, 1995).
  2. P.F. Drucker: The Practice of Management, Heinemann (1955).
  3. W.E. Deming: Out of the Crisis, MIT University Press (1981).
  4. P.D.T. O’Connor: The Practice of Engineering Management, John Wiley and Sons Ltd. (1985).
  5. W.A. Shewhart: The Economic Control of Manufactured Product, Van Nostrand (1931).
  6. P.D.T. O’Connor: Quantifying Uncertainty in Reliability and Safety Studies, Microelectronics and Reliability Vol. 35 Nos. 9-10 pp. 1347-1356, (1995).
  7. G. Hobbs: Accelerated Reliability Engineering: HALT and HASS, John Wiley and Sons Ltd. (1999).
  8. P.D.T. O’Connor: Test Engineering (to be published by John Wiley and Sons Ltd. (2001).

Return to Books and Articles