Some questions on issues that are often neglected in disaster recovery planning are cited below:
- Have events that could result in an interruption of the facility's operations been systematically identified?
- Do the internal events identified include such scenarios as loss of chilled water, failure of major electrical equipment, environmental contamination, or a hydrogen explosion in the battery room?
- Do the external events identified include such man-made scenarios as a natural gas pipeline rupture or aircraft impact and such natural events as flooding or earthquakes?
- Have the probabilities - and the consequences - of these events been assessed and quantified?
- Are cost-beneficial mitigation measures being taken to reduce the risk of an interruption of a facility's operations?
- How effective is the preventive maintenance program?
- Have the electrical supply systems been properly configured from a reliability standpoint?
- Are the fire detection systems adequate for the fire loading throughout the structure?
- Do facility-based contingency plans focus limited resources on programs and equipment associated with relatively high probability and more severe consequence events?
A risk assessment can provide an effective approach that will serve as the foundation for avoiding such disasters. Through risk analysis, it is possible to identify, assess, and then mitigate the risk. Such an analysis entails the development of a clear summary of the current situation and a systematic plan for risk identification, characterization, and mitigation.
Obvious benefits of risk assessment are that the results serve as the basis for cost savings through avoidance and the judicious use of finite resources for risk mitigation. With respect to avoidance, it is often possible to undertake actions that will eliminate major downtime events. For example, if critical mechanical or electrical equipment is vulnerable to flooding (e.g. from water storage tanks, piping, or natural events), containment barriers or equipment relocation may eliminate the potential for such an incident. Considering the allocation of resources, there is little need to control events with very low frequencies of occurrence. For example, there would be little need for redundancy in compressed air supplies to HVAC controls if the risk of failure is only once in every 100 to 1000 years. For high-probability events which cannot be effectively mitigated (e.g. seismic events), emphasis should be placed on contingency plans and disaster recovery plans to establish appropriate responses.
The Risk Assessment Methodology
The foundation of the risk assessment methodology is the definition of a critical outage. Based on the services provided, for example, global financial institutions can experience serious losses in a matter of minutes, while insurance companies may be incapacitated for 12 hours or more before being seriously affected. Critical manufacturing processes can often be interrupted for as long as 24 hours without serious implications. The definition of critical outage establishes the basis for the identification and assessment of downtime events.
Again, based on the facility, process, or operation under review, specific areas of concern should be selected for examination. These typically include electrical systems, HVAC/mechanical systems, fire protection, physical security, and external events, both man-made and natural. Depending upon the operation, telecommunication systems and hazardous materials may also require examination.
Once the analysis framework has been established, initial efforts should be focused on data collection, evaluation, and the identification of downtime scenarios. This can be accomplished through on-site inspection, document review, interviews with key personnel, and acquisition of relevant historical data. As part of the scenario identification, estimates of the duration of an expected outage are made, with such factors as diagnosis, parts acquisition, repair time, and any start-up time considered. In addition to major catastrophic events, attention should also be paid to less severe failures and sometimes to select common mode failures. Scenarios of concern typically encompass electrical power loss, the loss of cooling, fire hazards and such.
Once identified, scenarios are described in detail and grouped by impact or equipment type. This activity allows multiple causes with similar outcomes to be combined, rather than being analyzed as a number of individual events. This is important for the overall prioritization process, since it prevents the significance of a group of related events from being concealed by subdivision into many individual events. It also aids in any subsequent contingency planning effort by establishing the major categories of outcomes.
In general, two scenario attributes are of primary concern - outage duration and the expected frequency of occurrence. Recognizing the scenarios ultimately need to be characterized with respect to overall risk level, it is useful to define suitable ranges of likelihood and consequence. These ranges should be developed on the basis of the type of facility, process, or operation, and both the criticality with respect to operation and the level of uncertainty associated with estimation should be considered.
Some failure scenarios are amenable to relatively precise quantification with respect to frequency and outage duration. For example, some electrical equipment failures can be assessed using published historical failure data in applications similar to the situation under evaluation. Other scenarios allow only very general characterization. For example, historical data on building fires must be judgmentally adapted to particular structures or occupancies. Additionally, events initiated in adjacent facilities, or on nearby streets and highways, may be difficult to characterize because of a lack of information on the operations. In such cases, experience in similar situations and limited available data can be combined with engineering judgment to determine the appropriate overall outage and frequency categories.
Provision of adequate cooling to a data center is critical for operation. Should cooling flow be interrupted, shutdown of a data center may occur in a matter of minutes, depending on the quantity and types of computer equipment in use. Well designed critical computer facilities generally feature redundant capacity in critical mechanical equipment, such as chilled water pumps. As data centers expand, increased cooling capacity demand and/or expansion into areas not originally intended for critical processing can result in a reduction or elimination of design redundancy. Single-point-failure scenarios then arise from the potential failure of such critical equipment. Corrective actions may include redistribution of load or the provision of additional redundant backup equipment.
Electrical Equipment Scenarios
Electrical supply systems in critical facilities often include multiple services, backup generation, and uninterruptible power supplies (UPS). Such apparent redundancy may, however, be ineffective because of capacity limitations or critical equipment items that are not connected to backup systems. Scenarios of this nature may arise, for example, during system expansion when additional loads are connected to the power systems. Sometimes UPS systems have been found to lack the best bypass provisions for restoring power promptly in the event of failure and to facilitate maintenance. Sometimes critical equipment (in the sense of single-point failures) is not backed up by on-site spares with the resulting potential for prolonged outage because of the lead time required to acquire and replace failed components. The implications of such scenarios are highly site-dependent and typically involve business decisions, such as cost/benefit trade-offs between the likelihood of failure and the cost of mitigation actions.
Fire and Explosion Scenarios
A data center often has characteristics that present a fire hazard. Typically, numerous pieces of computer equipment are situated on raised floors. Ordinary combustibles are generally limited to stacks of paper or boxes. Underneath the raised floor, there is often extensive cabling, with PVC insulation representing more mass than vacant space. Fixed fire protection may include a Halon 1301 system tripped by cross-zoned smoke detection, with no protection provided under the raised floor. Mobile fire protection may be limited to hand-held fire extinguishers, with the principal agent being carbon dioxide.
Fires in data center areas can arise from problems with the wiring, electrical distributions system components, and electronic equipment (computer hardware, power switchgear, overcurrent protection devices, etc). The extensive cabling, particularly that below the raised floors, enhances the fire risk. Based on historical accident data, in combination with site-specific data, such as the mass and arrangement of the cabling, the probability of a fire occurring in the data center area is usually estimated to be “somewhat likely” (0.0001 - 0.01/yr). A fire in such a data center, particularly under the raised floor, could result in significant downtime - in excess of a week or more. This would typically be attributed to the mass of cabling in many areas and the toxic and corrosive combustion products resulting from a PVC fire.
In considering risk control options, replacement of the halon systems would be considered a high priority because of restrictions by the Montreal Protocol and the new Clean Air Act Amendments to control ozone depletion. Alternatives include pre-action sprinkler systems, which could reduce the risk of water damage from a failed head. These systems can often use the smoke detection system from the original halon system to reduce costs.
Beneath a raised floor, an obvious but generally difficult and probably impractical risk control option is to remove obsolete cabling. A lower cost alternative is to install a “very early smoke detection system” in which the smoke induction associated with such a unit reduces the activation time. Other options include a carbon dioxide system or a similar environmentally friendly gaseous extinguishing system, improved fire detection (e.g., line detectors), and /or improved passive fire protection (e.g., fire-resistant intumescent paints or fire stops).
Facilities are frequently located in densely populated urban areas. Consequently, risks exist from man-made events, including natural gas pipeline or steam line rupture. The latter sometimes results in an airborne asbestos release. Such events could damage a building or harm its occupants; in such an event evacuation of the building may be necessary to avoid exposure.
Natural phenomena also present risks to facilities worldwide. Not surprisingly, the phenomena of concern vary with geographical location.
Risk-based Contingency Planning
To facilitate risk-based contingency planning for a particular facility, a matrix can be generated, which prioritizes events and indicates responsible and affected departments. Such a matrix would allow each department to establish the relative significance of each event to its own activities. The result could then be used to support development or enhancement of the facility's contingency plan. Organizationally it would define responsibility for responding to specific situations.
The risk assessment results should be integrated with existing plans into a document, combining business continuity concerns with environmental/health/safety and regulatory issues. At the facility-level, these areas are often interrelated and should be managed accordingly. For example, if a facility experiences a fire, the downtime could be significantly minimized through a coordinated and technically appropriate response, including means of egress, emergency lighting, evacuation and head counts. On this particular topic, the United States has promulgated relevant regulations, and thus a facility contingency plan should reflect compliance with 29CFR1910.38.
Other issues with respect to the efficacy of contingency plans are represented in the questions that follow:
- What is the plan update procedure and schedule?
- Are there primary and secondary emergency operation centers?
- What are the procedures for notifying corporate management?
- What are the procedures for event diagnosis and assessment?
- Are there designated assembly areas?
- Are facility shutdown procedures coordinated with the disaster recovery plan?
- Is medical emergency preparedness adequate to handle anticipated situations?
- If the use of respiratory protection is anticipated, is the type, number and location of units appropriate, and have personnel been trained in its use?
- Is there a designated media briefing location?
- Is there a policy to provide assistance to employees laid off because of the incident?
- Is there an established source for photography and videotaping services?
These issues reinforce the fact that an effective and thorough contingency plan is an integral part of minimizing downtime events at a critical operation facility. In fact, the plan complements the risk control measures generated as part of the risk assessment. It can focus on events that cannot be eliminated practically, and that are relatively probable or more significant in terms of business interruption.
Marian H. Long, PE, CSP, is a senior consultant in the facility Safety and Risk Unit at Arthur D. Little.