A CALL TO ACTION :
 
NATIONAL AND GLOBAL IMPLICATIONS OF THE
YEAR 2000 AND EMBEDDED SYSTEMS CRISIS
 
A Working White Paper
Part 2: The Embedded Systems Crisis: Immediate Actions Needed
 
by Paula D. Gordon, Ph.D.
Revised February 25, 1999
 
(also posted in the archives at http://www.year2000.com and http://www.itpolicy.gsa.gov/mks/yr2000/y2kconf/papers/paper64.htm)

Introduction to Part 2: The Year 2000 technology crisis involves computer software and hardware, date sensitive embedded systems, and connectivity and interdependency concerns. Those addressing Year 2000 challenges have typically focused on information technology aspects of the problem and have too often failed to fully understand and acknowledge the challenges and threats posed by date sensitive embedded systems. Efforts to address the technology crisis have also tended to be based on a limited awareness and understanding of the interconnected nature of the crisis, and the potential for the cascading of failures and problems. Efforts to understand and address the threats have too often overlooked the implications of the crisis for domestic stability and national security, not to mention global stability and security.

While information technology problems have the potential for extraordinary impacts on all aspects of the nation's and the world's infrastructure, date sensitive embedded systems failures can pose even greater challenges, particularly if they occur at the same time as other infrastructure disruptions. Embedded systems failures can pose additional threats to life, health, and safety, including additional threats to social stability and environmental sustainability.

Embedded systems failures can trigger technological disasters which can impede mobilization efforts to deal with infrastructure disruptions. Infrastructure disruptions could in and of themselves be expected to tax emergency response capabilities to the limit.

The Embedded Systems Crisis

What Are Embedded Systems?

The following definition is taken from the United Kingdom's Action 2000 web site:

"Embedded systems contain 'programmed instructions running via processor chips....They perform control, protection, and monitoring tasks....In broad terms embedded systems are programmable devices or systems which are generally used to control or monitor things like processes, machinery, environments, equipment, and communications."
It is estimated that there may be from 10 to 25 billion embedded systems in existence. It is known that some small percentage of these are date sensitive. Of these a small, but significant percentage are not Year 2000 compliant. Estimates range from 0.2% to over 1%. That would mean that from 20 million to 250 million embedded systems failures could occur owing to the Year 2000-related non-compliance problems. (Source: The Gartner Group).

These include small failures that can have major impacts. Malfunctions could occur in all manner of equipment, devices, appliances, and systems found in homes, hospitals, buildings, plants, facilities, and systems. Malfunctions could occur as well in everything from rail and subway systems to water purification plants, wastewater disposal plants, oil and gas pipelines, oil refineries, oil tankers, off shore oil platforms, chemical plants,manufacturing plants, coal-fired plants, nuclear power plants, nuclear and other hazardous waste facilities and laboratories, biological and chemical warfare storage facilities, and weapons systems of all kinds.

There is simply not sufficient time and manpower to identify, assess, repair, replace, or "work around" all of the date sensitive embedded systems prior to January 1, 2000. (Indeed, some malfunctions could be triggered well in advance of that date.) Efforts are destined to be far less than 100% successful in making necessary repairs or taking other preventive or mitigative actions. In many cases, shut downs will be the only viable alternative.

The failures that are bound to occur may be expected to have an impact on the health and safety of nearby populations, on social cohesion and civility, on food and water supplies, on the economy, on foreign relations, and on the sustainability of the environment. Such impacts could affect small areas, as well as large regions all over the world. Commonsense dictates that greatly expanded efforts be made by the public and private sectors, nationally and globally, to identify, prioritize, and minimize the risks posed by those date sensitive embedded systems posing the greatest threats.

Current efforts to address Year 2000 computer software and hardware problems and embedded systems problems are grossly inadequate nationally and globally. In addition, efforts to address these problems tend to be based on a limited awareness and understanding of the nature and scope of the crisis. The problems are being poorly and unrealistically defined. Even the efforts to address the problems as presently understood are falling far short of the mark.

Indeed, efforts to address the problems have begun and are beginning much too late. The problems are widely understood as primarily involving computer technology, information systems, data processing systems, and communications technology. Resolving these problems involves making needed diagnoses and taking corrective action. Those who tend to define the problem in this narrow way are greatly underestimating the nature and the scope of the problem. There is an increasing chorus of others who see the problem as being much broader. They see the potential impacts as being much farther reaching. They see the societal infrastructure being significantly affected.

When embedded systems fail, they can fail in a variety of unpredictable ways. Small, seemingly insignificant failures can trigger other system failures.

A few illustrations of systems at risk and illustrations of what can go wrong (or) happen when an embedded system fails are as follows:

     
  1. The absence of adequate attention to the embedded systems in nuclear power plants was cited by GAO in a document dated March 18, 1998 (GAO/AIMD-98-90R, March 6, 1998).
  2. The computers which control nuclear weapon systems may malfunction. (See appendices.)
  3. The vulnerability of a generator temperature control system at a power plant in the United Kingdom were reported in the Electric Power Research Institute Proceedings from EPRI Embedded Systems Workshop, Proceedings dated 10/4/1997. A compliance test was conducted in which "(t)he (control) valve (for generator cooling) closed (fail safe), tripping the unit on high generator temperatures." It was concluded that "(l)oss of numerous generating units simultaneously in the United Kingdom could be devastating to the country." (See appendices for a fuller account of this incident.)
  4. The vulnerability of an Energy Management System (EMS), along with its redundant system has been described involving the Hawaiian Electric Company. This vulnerability first came to light in 1996. "The EMS is a redundant system, meaning that there are two computer systems: one that serves as the primary controller which scans remote points in the field, stores information in our database, and displays the information to load dispatchers in graphic display format and the other computer system acting as a hot standby system. These two systems pass information between them via a Ethernet cable and when the primary soft or hard "fails", the standby system assumes control.
In the year 2000 (Y2K) case, a failure would not fix the problem. Since both computer systems run the same software, if the primary had a problem with the date it would most likely have the same problem on the standby as well. This would result in both systems "failing" over to the other system (a kind of "thrashing"). While this would be a considerable headache for our team, it would most likely be very expensive for our company." (For a fuller overview of the scenarios that could occur given such a problem, see the appendices.)
     
  1. "(A) city discovered that their wastewater treatment plant would have dumped raw sewage into their bay on January 1, 2000 if they had not replaced the 286 chips that controlled the valves." (5-3-97 NTIS Symposium Presentation by Lt. Col. David C. Hall (USAFR) Wright-Patterson AFB, Ohio, "Year 2000 Problem: Infrastructure Aspects").
Why has so little attention  been given to problems relating to date sensitive embedded systems?

The reasons can be simply stated:

In our highly specialized world, relatively few people even know about the existence of date sensitive embedded systems. Of those who do, fewer still understand the complex technology. Those who understand the technology best are software, firmware, and hardware engineers and programmers who specialize in embedded systems.

Certainly, political leaders, policy makers, and others in roles of public responsibility cannot be expected to readily understand the technical intricacies of software, firmware, and hardware engineering and programming as these relate to date sensitive embedded chips. In addition, they do not always have on their staffs, individuals who have such technical expertise.

For all these reasons, very few public officials in any branch or at any level of government have readily grasped the significance that date sensitive embedded systems have in the context of the Year 2000 technology crisis.

Persons in key policymaking roles in emergency management may also lack the kind of technical background that would allow them to recognize the nature of the threats posed by the failure of date sensitive embedded systems. They may therefore fail to see the potential for technological disasters and may consequently fail to undertake necessary preparedness and mitigation measures. There in fact has been an apparent absence of sensitivity in the emergency management community to Year 2000 technology problems. An individual from the education sector who responded to the May WDCY2K survey noted that Y2K was not mentioned on the program of The International Emergency Management Society which met in Washington DC in June 1998. In the words of the respondent: "These are the people who help us recover from earthquakes, floods hurricanes, etc. I was told that a spokesperson from FEMA mentioned Y2K as a problem that had to be solved. She said FEMA was not compliant now but had plans to become compliant. She did not mention any possible disruptions or disasters that might result from firms or systems not being compliant. Hence, Y2K was described as an IT problem with no connection to the subject matter of the conference....Judging from the informal conversations during the conference, my impression is that emergency management professionals from around the world do not yet regard Y2K as a matter of professional interest to them." (www.csis.org/html/y2kpress.html#3)

Those who tend to grasp the significance of malfunctioning embedded systems are often people who are used to thinking "systemically", people who are used to thinking in terms of the interconnections within a system, within a set of systems, and amongst systems and sets of systems.

Owing to the failure to understand the problems associated with the Year 2000, there has been a failure on the part of the public and private sectors to assign or assume responsibility for addressing the problems posed by Year 2000.

Some of the other reasons why embedded systems have not been adequately addressed are as follows:

There are a relatively small number of persons who understand how embedded systems failures can be forestalled.

It can be extraordinarily difficult to access the embedded systems that need to be assessed in order to ascertain whether or not they are date sensitive and, if they are, whether or not they are Year 2000 compliant.

It can be extraordinarily difficult to assess the internal logic of the embedded system.

The scale of the problems is so great that there are not enough trained technicians who are capable of working on them.

The actual testing of an embedded system can damage the system and cause a malfunction.

Because of the multitude of models and versions of embedded systems, it is not possible to extrapolate from one system to another based on the testing of one. In a real sense, each system has to treated as if it were unique, because it may well be.

Identical chips may act differently in different systems.

In cases where a replacement chip is required, it may not be possible to identify the vendor or the vendor may be out of business. It may not be feasible to manufacture a replacement chip.

Malfunction of an embedded system may trigger other failures and the source of those failures may not necessarily be detectable.

Even if efforts were to bring the nation close to 100% success in addressing computer software and hardware problems, the threats posed by date sensitive embedded systems could make those efforts for naught. Some date sensitive embedded systems are simply bound to fail. Even one accidental nuclear weapons launch or in place accident is one too many. Join that possibility with a nuclear power plant failure like Chernobyl, a chemical plant disaster similar in magnitude to the disasters in Bhopal or Seveso, a release of toxic emissions from a chemical or biological weapons facility, and perhaps, multiple incidents of such events happening at once or in quick succession throughout the world and in the middle of our winter months and there would be national as well as global impacts on an unprecedented scale.

At the same time there could also be other problems whose duration would not necessarily be known at the time. These problems could involve a lack of electricity, a working phone system, radio, drinkable water, food, and fuel for heating and cars and all other forms of transportation.

Technological disasters combined with infrastructure disruptions such as these could make the difficulties of recovery formidable.

No one in the world will be immune from harm if the present level of understanding and if the present level of effort are not exponentially increased as rapidly as humanly possible. This calls for leadership of a type that is rare. This is owing to the fact that one of the gravest concerns in this crisis is the possible dissolution of the social fabric, which must be kept intact if we are to work through the crisis.

Possible approaches are many. Commonsense dictates that there needs to be an immediate prioritization of what needs to be done to minimize our risks to the greatest extent humanly possible.

Assigning or selecting people to do what needs to be done is next. Priority List A possible list of highest priorities for mitigating and preventing threats and exposures would include a focus on the following:

Nuclear Weapons Systems, including dismantled nuclear weapons - Domestic and Foreign
Biological and Chemical Warfare Plants - Domestic and Foreign
Nuclear and other Hazardous Stockpiles or Storage Facilities - Domestic and Foreign
Nuclear Power Plants - Domestic and Foreign
Chemical Manufacturing Plants - Domestic and Foreign

[G8 leadership and UN cooperation will be required in all of these top five priority areas. Non-aggression pacts will be in order, as significant disarming may well be required.]
Nuclear Weapons Systems - Domestic and Foreign Biological and Chemical Warfare Laboratories, Plants or Storage Sites and Other Similarly Hazardous Sites - Domestic and Foreign Nuclear Power Plants - Domestic and Foreign
Chemical Manufacturing Plants and Refineries - Domestic and Foreign
The Electric Power Grid
Oil and Gas Pipelines
Refineries
Mines
Dams
Ships and Tankers
Water Purification Plants
Wastewater Treatment Plants
Food Production and Distribution
Fuel Distribution Systems
Telecommunications Systems
Rail Systems
Air Transport
Mass Transit
Emergency and Disaster Preparedness and Management
Health Care
Shelter
Business and Industry
The Labor Force
Banking and Financial Sector
The Not-for-Profit Sector
Maintenance of Civil Order ~ Military, National Guard, Police, Fire, and Emergency Dispatch Services
Security of Jails, Prisons, and Mental Institutions
Readiness of Law Enforcement, National Guard/Reserve, and Other Essential Workers to Serve

By focusing efforts on the Priority List as proposed here, there would be the best chance of safeguarding all of the following:
Public Health and Safety
Social Stability
Global Stability
Environmental Sustainability

The rationale for assigning highest priority status to the top five concerns is owing to the great and possibly overwhelming threats and challenges that could result in breakdown of the social fabric. If we are successful in minimizing these threats, we will have made our recovery efforts far easier. Were we to fail to address these threats, and were any of them or many of them to materialize, the consequences could be beyond imagination, rendering crisis management and recovery efforts extraordinarily difficult. If we were to fail to address these threats, we would be increasing immeasurably the difficulties in restoration and recovery efforts.

This is indeed a case where an ounce of prevention could be worth tons in cure. In our case, billions of dollars invested in prevention and mitigation and preparedness, along with a willful slowing of our economy to accomplish that could result in the husbanding of trillions of dollars worth of resources in the long run and could make significantly smoother the period of recovery.

What kind of organization is needed to accomplish these Herculean efforts?

A state of emergency could be declared and an organization could be put in place in the Executive Office of the President to orchestrate and carry out the entire range of tasks needed to address the crisis.

The mission, functions, and scope of purpose of such department are detailed in the White Paper. Other measures that would need to be taken or promoted nationally and globally are also detailed.

                                                        **********************************
 

                                                        **********************************

Slide Presentation to Accompany Paper

Bruce Webster and Paula D. Gordon, Slide Presentation for a White Paper, "A Call to Action: The National and Global Implications of the Year 2000 Embedded System Crisis", June 1998.

References:

Roleign Martin's embedded systems site: http://ourworld:compuserve.com/homepages/roleigh_martin

Mark A. Frautschi, "Embedded Systems and the Year 2000 Problem (The OTHER Year 2000 Problem)" (Draft), JULY 1, 1998. http://www.tmn.com/~frautsch/y2k2.html

Gene Bylinsky, "Industry Wakes Up to the Year 2000 Menace", Fortune, April 27, 1998

Dick Lefkon & Bill Payne, "The Practical Engineer ~ Making Embedded Systems Year 2000 Compliant", IEEE Spectrum, June 1998

The Gartner Group Source Documents on Year 2000 http://gartner12.gartnerweb.com/gg/static/itjournal/gspecial1.html

GAO Exposure Draft: "Business Continuity and Contingency Planning
http://www.gao.gov/special.pubs/bcpguide.pdf

GAO Report "The Computing Crisis, an Assessment Guide"
http:///www.gao.gov/special.pubs/y2kguide.pdf

Appendices:

Nuclear Weapons Systems:

CNN Broadcast, Sunday Morning News: Y2K: Nuclear Weapons Computers System Pose Worldwide Threat aired 6-14-98

Re: "Computers which control nuclear weapon systems (that) have software glitches in them" Statement made by John Pike, Federal of American Scientists:

"The fundamental problem is that we don't know what could happen. Probably nothing's going to happen. They're probably just going to sit there and tell their operators: I'm confused.

There's a real risk though that we could see the sort of computer malfunctions that we've seen in previous years, where the command and control systems erroneously report that an attack is in progress; erroneously direct missiles to shoot at the wrong target; and at a minimum, cause all of the countries to put their missiles on much higher levels of alert, because they're concerned about their unreliable warning systems.

So, in all probability, nothing is going to happen, but there is a small, finite risk that this could lead to an accidental nuclear war, simply because people fail to fix their computers. And that's obviously a totally unacceptable risk."

 

Electric Utility Embedded System Problem

Sunset for the British Empire?

Source: Electric Power Research Institute Proceedings from EPRI Embedded Systems Workshop, Proceedings dated 10/4/1997

Y2K testing was conducted on a generator temperature control system at a power plant in the United Kingdom. To test for Y2K compliance, the control system clock was set to just prior to midnight, Dec. 31, 1999. Twenty seconds past midnight, the unit tripped on high generator temperature.

It turns out the process value for the control valve for generator cooling is integrated over time for smoothing and when the time moved past midnight from '99 to '00, the PV was integrated over infinity. The valve closed (fail safe), tripping the unit on high generator temperature. If this were an isolated incident, the industry would evolve through the year 2000 with little difficulty. However, the algorithms used in this control system are common throughout Europe and most systems are vulnerable to the problem. Loss of numerous generating units simultaneously in the United Kingdom could be devastating to the country.

Second Example of an Electric Utility Embedded System Problem

Description of Hawaiian Electric Company Y2K Findings - Energy Management System
 
To: rcowles-Y2K
Rick Cowles (rcowles@waterw.com)
Subject: Y2K Notes
Date: Fri, 31 Oct 97 11:12:00 HST
Overview

In our annual project meeting for 1996, our team discussed the possibility of a problem existing with a century changeover on the Energy Management System (EMS). Fortunately for HECO one of the developers who worked for Rockwell International, the vendor who wrote the EMS software in 1981, is working on our team and was able to inform us of the probable failure of the EMS when a century changeover occurred.

The EMS is a redundant system, meaning that there are two computer systems: one that serves as the primary controller which scans remote points in the field, stores information in our database, and displays the information to load dispatchers in graphic display format and the other computer system acting as a hot standby system. These two systems pass information between them via a ethernet cable and when the primary soft or hard "fails", the standby system assumes control.

..Since both computer systems run the same software, if the primary had a problem with the date it would most likely have the same problem on the standby as well. This would result in both systems "failing" over to the other system (a kind of "thrashing"). While this would be a considerable headache for our team, it would most likely be very expensive for our company. The follow is a list of scenarios that "could" happen if the EMS were unavailable for an extended period of time:

     
  1. The load dispatchers cannot schedule generation so power plant operator would have to monitor/control frequency at the plant. This "could" result in a variable range of frequency at customer sites.
  2. The load dispatchers cannot control breakers at major substations. As a result, HECO would have to send additional field personnel to substations and have them manually open/close breakers and give dispatchers "readings" of voltages and/or amperes.
  3. Normally if a major transmission line was severed, the EMS would be able to detect the failure and re-route electricity to provide power to customers as well as safeguard other transmission lines to prevent over-voltage failures (transmission lines have a recommended operating level - if more amps are sent across the line, the line will begin to sag and then either snap or catch fire). If the EMS were unavailable, HECO would not be able to detect a failure and a outage would occur.
  4. Part of the features of a EMS are engineering programs that schedule generation based on the generating unit's most optimum or efficient generation of electricity. If the EMS were un-available, using a generator not in it's optimum level to produce could cost the company hundreds of thousands of dollars.
Year 2000 Fix

The EMS uses computers from Digital Equipment Corporation VAX/VMS architecture/operating system. The application software was originally written by Rockwell International, but since the software was delivered in 1981 it has been continually enhanced and modified by HECO engineering and software staff.

The EMS is comprised of over 20 VAX/VMS servers and workstations. By design, one server serves as the primary development server where all software modified and new applications are stored. Solving the year 2000 problem involved the following steps:

(Given: The VMS operating system does not have any problem with the Year 2000 problem.)

Examining modules which we believed used the time and date services. This involved using an editor and examining the code to look at the time and date services.
After examination, modules that used the time and date services were found to call other modules stored in an EMS system library (subroutines compiled into a object library). These routines referenced a smaller set of subroutines written in assembler.
At first, an attempt was made to use the C programming language and standard C library routines and system services provided by the operating system. Unfortunately, the EMS historical services use an unusual method of computing when the day ends. At exactly midnight (00:00:00.00) the EMS assumes that time is the previous day and at (00:00:01.00) the EMS sets the day to the next day.
Instead of modifying hundreds of programs and the proprietary database that use the EMS historical method of computing the date, it was decided to dump the C code and modify the existing assembly code.
The assembly code used a two digit fields for the month, day, and year. To convert the assembly language routines, an assumption was made: Any year value that is greater than or equal to 81 is assumed to be in the 20th century (1900) and any year less than 81 is assumed to be in the 21st century (2000). "81" is used since the system was delivered in 1981 and no data predates that date. Once the year field is obtained (either by an operating system service or from an application which provides it) the check is made. Several assembly language routines do computation on the date/time fields; these routines were carefully modified to conform with our assumption as the date and time is converted into a VMS compatible time, computed on, and then converted back to an EMS acceptable time format.
Once the modification of the assembly language routines were complete, they were recompiled and put into a test library. All applications for the EMS were recompiled (several hundred programs, functions, and subroutines) for testing. A workstation onto it. The EMS system was brought online in a test mode and after fixing some minor bugs in other applications, the system was scheduled for an extended test. The workstation's date was set to 12/20/1999 and the test was run for approximately 35 days. During this period all reports were examined to see if there were any problems, and periodic system checks were performed to see if any applications aborted or had problems with the date/time conversion into the next century.
Once the test was completed, the assembly language source code changes were presented in a group meeting and after the staff was satisfied with the changes, the EMS applications were recompiled again and installed in the real environment. (Rick Cowles Web site: www.euy2K.com)

 
                                                       [ White Paper: Part 1 ]        [ White Paper: Part 2 ]       [ White Paper: Part 3 ]
 
                                                       [ Table Of Contents ]         [ Executive Summary ]        [ References and Resources ]



                                                                                                                                                                               Copyright © 1999, Paula Gordon
 
                                                                          Return to Paula Gordon's Y2K Web Page