STANDARDS IN RELIABILITY AND SAFETY ENGINEERING
Patrick D.T. O’Connor
HOW IS WORLD CLASS RELIABILITY AND SAFETY ACHIEVED?
If we think of the industries, systems and products that set the world standards for reliability and safety (R&S), among the ones that are most likely to come to the minds of both the person in the street and of R&S professionals are cars, commercial aircraft, electronics-based systems and electronic components. These all exist in very competitive international markets, in which the customer has a wide choice, and innovation, improvement and performance vs. price are all essential for survival of the suppliers. We note that new generations of systems and products give increasing performance and complexity, at lower prices and higher R&S.
For example, modern aircraft engines are more powerful and more economical than ever, and so reliable that twin engined commercial jet aircraft have been certificated for long overwater flights. Modern integrated circuits such as microprocessors, memories and signal processors contain over a million separate components and tens of metres of interconnections, yet their reliability is so high that systems that use large quantities, such as mobile phones, computers, and the huge range of modern electronic systems are built at low cost and with very high reliability and safety. When modern electronic systems fail the cause is very rarely the failure of one of these complex components: it is far more likely to be due to something much more mundane like a solder joint or connector.
Why should it be necessary to describe these achievements, if they are so well known? I suggest that it is necessary in order to identify and concentrate on how such levels of excellence are attained, and to contrast the methods used with those that are applied in areas that do not achieve such levels.
These and other examples of R&S excellence are all achieved by the application of five very simple principles. These are:
1. Excellent design.
2. Excellent production.
3. Excellent support and maintenance (where appropriate).
4. Continuous improvement of design, production and support/maintenance so that excellence is driven ever upwards.
5. The realisation that such excellence and improvement actually reduce costs, by improving development and manufacturing productivity and market share. This last principle is, of course, at the heart of the teaching of the late W.E. Deming (reference 1).
HOW IS LOW R&S ACHIEVED?
The corollary of the above argument must be that relatively low R&S is achieved by lower standards of design, production and support, and when improvement is slow or nonexistent. It is well known to all practical engineers that it is expensive, time consuming and heartbreaking to live with or to try to improve the reliability of a system or product that has been poorly designed or manufactured. Reliability improvement by modifications to in-service systems is particularly difficult. Low reliability is the result of failures. So are many accidents. Failures are nearly always, when analysed to determine the root causes, the results of human errors. People make mistakes in design, manufacture, maintenance and use. Such mistakes are inevitable in the context of modern engineering, because of the inherent fallibility of humans, and the complexity and uncertainty of much of the work. Applying the five basic principles of excellence described above is not easy. Therefore the first responsibility of management of R&S must be to reduce the possibilities of human error, throughout the product cycle. Because failures are caused by people, very high R&S can be achieved only through management and leadership. This is in fact the foundation of the modern approach to total quality, used by all of the world’s best companies.
THE ALTERNATIVE WAY?
At about the same time that Deming was explaining these principles in Japan, the US Department of Defence introduced a series of military standards on quality and reliability. The quality standard was MIL-Q-9858, which laid down requirements for suppliers to have in place a quality management system, which would be audited by DOD quality staff. It was thought that this approach would provide greater assurance that quality products would be delivered. It would also remove the need for government inspectors to have to inspect all delivered products. ISO9000 is the direct descendant of MIL-Q-9858: the only significant difference is the system of third party assessment, rather than assessment by the customer.
In the reliability field, as a result of the work of the DOD Advisory Group on the Reliability of Electronic Equipment (AGREE), standards were produced for reliability programme management (MIL-STD-785), reliability demonstration (MIL-STD-781), reliability prediction (MIL-HDBK-217), and other activities such as failure modes and effects analysis and testing.
These standards were widely used in defence equipment procurement, not only in the USA but throughout NATO. Their use spread beyond defence systems, and they became the models for similar standards, company procedures, textbooks and teaching worldwide. However, the Japanese, under the influence of Deming, resisted the use of these approaches, with results that are familiar to all. It is also interesting to note that NASA, then embarking on the man in space and planetary exploration programmes, likewise deliberately avoided using the Q&R standards, and set up their own system based on the philosophy mentioned earlier.
In the safety context, several of the methods developed for reliability analysis were adopted, and others were developed, notably fault tree analysis and HAZOPS.
THE DIFFERENCE
What then is the difference between the people-oriented philosophy taught by Deming, and the approach based on standards? The simple, fundamental difference is that the “standards” approach is based upon the principles of “scientific management”, introduced by F.W. Taylor early this century (reference 2). Scientific management teaches that people work most effectively when given specific instructions, which they must follow. It is the task of managers to produce the work instructions, and to ensure that they are followed. In effect, working people are treated in the same basic way as machines. Scientific management led to work study, demarcation of work boundaries, the mass production line, and the separation of labour and management.
It was Peter Drucker (reference 3) who exposed the error of scientific management. Drucker explained that people at work are not at all like machines. Unlike machines, people want to make contributions to how their work is planned and performed, and they want to do their jobs well. Their motivation and effectiveness are ultimately determined by the way that they are managed. In principle, there is no limit to the quality of work that people, as individuals and in teams, can produce, so that continuous improvement is a realistic objective. Drucker stressed that management’s role is not merely to encourage excellence and improvement, but to demand it. These lessons are particularly crucial in the context of engineering development work, since this represents the ultimate in modern human endeavour.
It is particularly interesting that Drucker wrote that “the countries that will lead the world in industrial and economic development by the end of the century would be those that understood and applied the “new management” “. Despite this, much management teaching in the West is still based on Taylor’s “scientific” principles. Engineers in particular often find it difficult to break out of this mould, since we are scientists by nature and training.
Everything that Deming taught in Japan in the 1950’s is based upon Drucker’s philosophy. The Japanese quality circles movement, and Deming’s instructions to remove quantitative targets from people at work, in his “14 points for managers” are consistent with Drucker’s teaching. The evidence of the effectiveness of the Deming approach to managing quality and productivity is overwhelming. By contrast, there is considerable controversy regarding the costs and effectiveness of ISO9000 (for example, reference 4). We will discuss the influence of ISO9000 and other standards in the following paragraphs.
THE INFLUENCE OF SCIENTIFIC MANAGEMENT ON QUALITY AND RELIABILITY
It is clear that the “standards” approach to quality and reliability is based on the principles of scientific management.
The ISO9000 requirement that products (and services) must be delivered according to a “system”, to which all concerned must comply, is totally in line with the “scientific” view of people at work. It neglects the fact that if people are left out of the day-to-day decisions that affect the ways that they work, they will perform in ways that are below the levels that they could achieve if given the freedom and responsibility to participate in the planning and improvement processes. It implies that work is performed better if people do what they are told in written procedures, and if their managers have “visibility” of all methods and progress. Such thinking is clearly approprite to machines, and can be appropriate when applied to simple mechanistic human processes such as maintaining calibration records. However, it is far from being the best way to manage knowledge-based work, as Drucker explained.
Of the many damaging effects of the standards-based approach to quality, two should be emphasised. The first is that it generates very low expectations. We are all familiar with the ballyhoo that surrounds the award of an ISO9000 certificate, with presentations by dignitaries, press announcements, and statements about committment to quality. Yet in fact the award signifies only that a third party organisation has been paid to declare that procedures are in place and that people are following them. The question of the actual quality of the product or service is not addressed. However, the recipients of the award and their customers often believe otherwise. There is nothing in the standard that prevents or discourages high quality and reliability, and there are of course many cases where the exercise of accreditation has resulted in improved processes and products. However, as reference 4 explains, the way that the standard is implemented does not assure the achievment of high quality. The quality journey is seen as the road to ISO9000, particularly as it involves so much effort and has to be retrod every year.
The second worrying feature of the ISO9000 approach is the cost involved. UK industry pays £80M per year to implement ISO9000. That does not include the internal costs of management time, procedures writing, etc. A very large number of people are employed in the essentially nonproductive business of ISO9000 auditing, consulting and training. By contrast, the Deming approach costs nothing, though of course it does not provide a certificate.
Those reliability standards which apply mathematical/quantitative methods are also based on the inappropriate application of “scientific” thinking. An engineered system or a component has no intrinsic property of reliability, expressible for example as a failure rate. Truly scientifically based properties of systems and components include mass, power output, etc., and these can therefore be predicted and measured with credibility. However, whether a missile or a microcircuit fails depends upon the quality of the design, production, maintenance and use applied to it. These are human contributions, not “scientific”. Therefore standard methods and “databases” that have been developed for reliability and safety measurement and prediction are without true scientific foundation, and are therefore not admissable in engineering.
Such methods similarly generate low expectations. The reliability predicted according to standards such as MIL- HDBK-217 is very much lower than can be achieved by modern systems, at lower costs of development and production than for less reliable equipment. System reliability demonstrations based on wrongful assumptions such as statistical independence of failures, constant failure rates and the inevitability of failures can only be misleading. The arguments against the unwise use of quantitative methods in reliability and safety are stated in more detail in Reference 5. In retrospect we should wonder why we ever allowed ourselves to embrace such meta-engineering.
These arguments against the use of standards for quality and reliability are covered in more detail in Reference 6.
TOTAL QUALITY MANAGEMENT
Total quality management (TQM) is the integration of all activities that influence the achievement of ever-higher quality. In this context quality is all-embracing, covering customer perceptions, reliability, value, etc. TQM is the philosophy that drives the quality revolution described in the first section of this paper. In the engineering companies that embrace this philosophy we do not see any demarcation between “quality” and “reliability”, in organisations or in responsibilities. The present Q&R standards, however, and nearly all Q&R teaching, perpetuate this unhelpful division. The planned “improvements” to ISO9000, and the series of new “dependability” standards from bodies such as ISO and CENELEC continue to do so.
The BSI has produced BS7850, the UK standard on TQM, in an attempt to provide guidance on the subject. This is analogous to attempting to produce a standard on how physicists should search for new theories, or how engineers should design a new product. As Reference 4 states, “the very nature of a standard is antithetical to the philosophy of TQM”.
SAFETY
The “Safety Case” Approach
In certain industries, in which there has been a public perception of high or uncertain risk as a result of the application of new technologies, regulatory authorities have instituted processes to provide public assurance that risks are understood and minimised. We have seen this particularly in the nuclear power, petrochemical and offshore oil and gas sectors. The process includes preparation and acceptance of a safety management system and assessment of risks.
The methods applied were developed initially in relation to nuclear power, where public unease regarding radiation leakage was the main driver. Formal risk identification and quantification techniques, particularly fault tree analysis (FTA) and failure data systems were developed and applied. The methods and data used in reliability engineering were also adapted for safety assessment purposes. Because of the novelty of most nuclear systems the analyses were required to be conducted in great detail, and related to criteria such as allowable probabilities of foreseen hazardous events.
Formal safety assessment work tends to emphasise the quantitative, rather than the qualitative aspects. This is due to the fact that the people involved are “experts”, and tend to use the methods and data that have been used for reliability assessment. However, safety assessments are usually related to such low probabilities that the quantification is usually of a highly uncertain nature, and subject to wide boundaries of credibility.
In the UK, the integration of the Nuclear Inspectorate into the Health and Safety Executive reinforced the trend towards harmonisation of the methods across different sectors. For example, the hazard and operability study (HAZOPS) method, developed for application to new chemical process plant, is now widely used in other sectors.
One feature of the safety assessment of these new systems is the insistence by the regulatory authorities that they be performed, or reviewed, by independent experts. Another is to attempt to prove that the likelihood of occurrence of specified hazards falls below levels considered to be acceptable. Since such levels are extremely low (of the order of not more than once per million years), the quantification of these assessments clearly involves considerable uncertainty, especially when new technology is involved.
The justification for the formal approach to safety assessment and management is that it ensures the existence of a fully documented, auditable safety management system, and identification and assessment of all potential risks. The implication is that if a safety management system exists, and if risks are identified and controlled, safety will be assured. Further assurance will be provided by the independent audit of the safety management system, and independent assessment and review of the risks.
Systematic identification and elimination of risks presented by new systems is obviously an important aspect of design and development. It is relatively easy to accept that this approach is appropriate in the context of a new system that involves perceived societal risks. Therefore the system providers and operators accept the imposition of this approach in the context of a new nuclear power station or petrochemical plant. In such systems there is usually a considerable amount of engineering novelty, and possible public anxiety must be assuaged. However, these factors do not generally apply to existing systems such as railways. The great majority of existing systems and operations present risks well below what are considered acceptable (no one is afraid on a train!), and there are few novel technological applications which introduce new risks. The industries understand the risks and how to control them, as described below.
Aviation
The requirements for safety assurance of commercial aircraft are laid down in national and international airworthiness regulations. These are produced primarily by the Federal Aviation Administration in the USA, the Civil Aviation Administration (CAA) in the UK, and the European Joint Airworthiness Administration (JAA). There is a high degree of harmonisation between the different organizations, so that certification of an aircraft to one system covers practically all of the requirements of the others.
The airworthiness regulations cover design and construction of aircraft, as well as maintenance and operation. They do not cover “infrastructure” aspects such as air traffic control and airports, for which there are no equivalent regulations.
The recent certification programme for the Boeing 777 airliner provides a good illustration of how the system operates. The United Airlines aircraft use the Pratt and Whitney engine, and this variant received FAA certification to enable the airline to commence passenger services on schedule. However, British Airways specified a new General Electric engine, which had not yet been subjected to the whole range of mandatory tests, including bird ingestion. Delays in demonstrating compliance have led to a much later service introduction by British Airways. However, at all times the companies involved have known exactly what needed to be achieved in order to demonstrate compliance.
Road Transport and other Industries
Road vehicles, whether private, commercial or public service, must comply with the relevant national or international safety regulations. These regulations are clear and specific. The designer or operator of a new vehicle is not required to submit a safety case, and no independent risk assessment is required.
The same types of consideration apply to other industries, for example building and electrical installations, for which government regulations provide specific safety requirements. Rail transport generally is governed by similar policies, though in the UK the safety case approach is currently being applied to the railways, with results that well illustrate the problems that can arise when “systems” based concepts replace more practical, realistic methods. (Reference 7).
For defence products, the US DOD has issued MIL-STD-882 and the UK MoD has issued Def Stan 00-55, which cover safety assurance requirements for newly designed systems. No safety cases are required, but a risk assessment must be produced by the manufacturer. The requirements are clear in relation to the extent of work required in relation to the type of system and the risks involved, and they are supported by further specific requirements.
Safety Case Limitations
The major limitation of the safety case approach is the fact that management “systems” cannot provide assurance that people will not make omissions or mistakes. The history of nearly all well-publicised accidents, such as the Clapham and Cowden rail crashes in the UK, Chernobyl, Bhopal, Piper Alpha, etc. shows that accidents are caused primarily by people doing unexpected things, or by not doing expected things. Accidents are also caused by engineering design errors, such as the El Al Boeing 747 crash at Amsterdam (engine pylon bolt fatigue) and the Hyatt Regency hotel bridge collapse (design change which reduced stress margin). These errors often occur across interfaces, or, in other words, there has been inadequate communication, in design, installation, maintenance, or operation. Of course a safety management system, involving identification of risks and responsibilities, training and supervision will reduce the likelihood of accidents. However, management systems alone cannot eliminate the possibilities of accidents in systems such as these. The danger of placing too much emphasis on the management system and on inappropriate quantitative analysis is that important potential causes of accidents are not revealed, because all attention is devoted to complying with the formal, mandatory requirements. This effect is clearly apparent in the analogous approaches to quality and reliability discussed earlier.
CONCLUSIONS
Standards for quality, reliability and safety which are based upon “scientific” or “systems” thinking almost always have the opposite effect to that intended. Costs rise, and quality, reliability and safety are not generally improved, and never attain world class levels. Against the improvements that are claimed must be counted the high direct and indirect costs of implementation, and the fact that improvements generated are small compared with those attainable by other methods.
Therefore, we should cease to develop and use the standards that I have discussed. ISO9000 should be relegated to a guidance document to describe a minimal quality system, and the whole structure of accreditation and certification should be abandoned. The quantitative reliability standards and related documents and databases should likewise be reviewed, and should not be imposed or given “scientific” status. Work on similar standards currently underway, eg. in ISO/IEC, should be stopped.
It is interesting to note that there is already a trend away from some of the quality and reliability standards. The European Foundation for Quality Management (EFQM) has produced guidelines for achieving world class Q&R, based on self assessment and the TQM philosophy. The UK MoD has relegated MIL-HDBK-217 and other “parts count” reliability prediction methods to a non-preferred status in the 1993 update of Def Stan 00-41. To these must be added the large number of excellent suppliers, as discussed earlier, who have never used such methods but have nevertheless led the world in quality and reliability achievement.
The safety case approach to formal safety assessment should be applied only to novel, few-of-a-kind systems that present possible large scale hazards. Safety assessments should be quantified only to the extent that is realistic and practicable, as discussed in reference 5.
Engineering is based on science, so scientific principles must be applied where they are appropriate. Quality, reliability and safety are the results of human performance, which is not governed by the laws of science. Therefore the levels of quality, reliability and safety that are necessary today cannot be achieved by “scientific” management methods. The only, proven, way is through the “new” management. This is an art, not a science.
© P.D.T. O’Connor 1996
References:
1. W.E. Deming, Out of the Crisis, MIT University Press 1986.
2. F.W. Taylor, The Principles of Scientific Management, Harper and Row 1911.
3. P.F. Drucker, The Practice of Management, Heinemann 1955.
4. UK Science and Engineering Policy Studies Unit, UK Quality Management – Policy Options, 1994.
5. P.D.T. O’Connor, Quantifying Uncertainty in Reliability and Safety Studies, paper presented to Society of Reliability Engineers’ Symposium, Arnhem 1993.
6. Achieving World Class Quality and Reliability: Science or Art? Quality World (IQA), October 1995.
7. P.D.T. O’Connor, Safe and Reliable Railways: What can we Learn from Competing Transport Industries? Proc. Railtech Conference, 1996, IMechE, UK.
