
Business Interruption Risk Assessment: A Multi-Disciplinary Approach
By Marian H. Long
Effective contingency planning and disaster
recovery coordination require expertise in all aspects of disaster management,
including avoidance and recovery. It is too late to plan an effective response
after a disaster has struck and significant downtime has been incurred. The
resulting outage from such a disaster can have serious effects on the viability
of a firm's operations, profitability, quality of service, and convenience.
In fact, these consequences may be more severe because of the lost time that
results from inadequate planning. After such an event, it is typical for senior
management to become concerned with all aspects of the occurrence, including
the measures taken to limit losses. Their concerns range from the initiating
event, and contributing factors, to the response plans, effective contingency
planning and disaster recovery coordination require expertise in all aspects
of disaster management, including avoidance and recovery. It is too late to
plan an effective response after a disaster has struck and significant downtime
has been incurred. The resulting outage from such a disaster can have serious
effects on the viability of a firm's operations, profitability, quality of service,
and convenience. In fact, these consequences may be more severe because of the
lost time that results from inadequate planning. After such an event, it is
typical for senior management to become concerned with all aspects of the occurrence,
including the measures taken to limit losses. Their concerns range from the
initiating event, and contributing factors, to the response plans, equipment,
training, and recovery operations used to counter it. Rather than delegate disaster
avoidance to the facilities or building security organizations, it is preferable
for a firm's disaster recovery planner(s) to understand fully the risks to operations
and the measures that can minimize the probabilities and consequences, and to
formulate their disaster recovery plan accordingly.
Some questions on issues that are often neglected in disaster recovery planning are cited below:
Have events that could result in an interruption of the facility's operations been systematically identified?
Do the internal events identified include such scenarios as loss of chilled water, failure of major electrical equipment, environmental
contamination, or a hydrogen explosion in the battery room?
Do the external events identified include such man-made scenarios as a natural gas pipeline rupture or aircraft impact and such
natural events as flooding or earthquakes?
Have the probabilities - and the consequences - of these events been assessed and quantified?
Are cost-beneficial mitigation measures being taken to reduce the risk of an interruption of a facility's operations?
How effective is the preventive maintenance program?
Have the electrical supply systems been properly configured from a reliability standpoint?
Are the fire detection systems adequate for the fire loading throughout the structure?
Do facility-based contingency plans focus limited resources on programs and equipment associated with relatively high probability
and more severe consequence events?
A risk assessment can provide an effective approach that will serve as the foundation for avoiding such disasters. Through risk
analysis, it is possible to identify, assess, and then mitigate the risk. Such an analysis entails the development of a clear summary of
the current situation and a systematic plan for risk identification, characterization, and mitigation.
Obvious benefits of risk assessment are that the results serve as the basis for cost savings through avoidance and the judicious use
of finite resources for risk mitigation. With respect to avoidance, it is often possible to undertake actions that will eliminate major
downtime events. For example, if critical mechanical or electrical equipment is vulnerable to flooding (e.g. from water storage tanks,
piping, or natural events), containment barriers or equipment relocation may eliminate the potential for such an incident. Considering
the allocation of resources, there is little need to control events with very low frequencies of occurrence. For example, there would
be little need for redundancy in compressed air supplies to HVAC controls if the risk of failure is only once in every 100 to 1000
years. For high-probability events which cannot be effectively mitigated (e.g. seismic events), emphasis should be placed on
contingency plans and disaster recovery plans to establish appropriate responses.
The Risk Assessment Methodology
The foundation of the risk assessment methodology is the definition of a critical outage. Based on the services provided, for
example, global financial institutions can experience serious losses in a matter of minutes, while insurance companies may be
incapacitated for 12 hours or more before being seriously affected. Critical manufacturing processes can often be interrupted for as
long as 24 hours without serious implications. The definition of critical outage establishes the basis for the identification and
assessment of downtime events.
Again, based on the facility, process, or operation under review, specific areas of concern should be selected for examination.
These typically include electrical systems, HVAC/mechanical systems, fire protection, physical security, and external events, both
man-made and natural. Depending upon the operation, telecommunication systems and hazardous materials may also require
examination.
Once the analysis framework has been established, initial efforts should be focused on data collection, evaluation, and the
identification of downtime scenarios. This can be accomplished through on-site inspection, document review, interviews with key
personnel, and acquisition of relevant historical data. As part of the scenario identification, estimates of the duration of an expected
outage are made, with such factors as diagnosis, parts acquisition, repair time, and any start-up time considered. In addition to
major catastrophic events, attention should also be paid to less severe failures and sometimes to select common mode failures.
Scenarios of concern typically encompass electrical power loss, the loss of cooling, fire hazards and such.
Once identified, scenarios are described in detail and grouped by impact or equipment type. This activity allows multiple causes
with similar outcomes to be combined, rather than being analyzed as a number of individual events. This is important for the overall
prioritization process, since it prevents the significance of a group of related events from being concealed by subdivision into many
individual events. It also aids in any subsequent contingency planning effort by establishing the major categories of outcomes.
In general, two scenario attributes are of primary concern - outage duration and the expected frequency of occurrence. Recognizing
the scenarios ultimately need to be characterized with respect to overall risk level, it is useful to define suitable ranges of likelihood
and consequence. These ranges should be developed on the basis of the type of facility, process, or operation, and both the
criticality with respect to operation and the level of uncertainty associated with estimation should be considered.
Some failure scenarios are amenable to relatively precise quantification with respect to frequency and outage duration. For example,
some electrical equipment failures can be assessed using published historical failure data in applications similar to the situation under
evaluation. Other scenarios allow only very general characterization. For example, historical data on building fires must be
judgmentally adapted to particular structures or occupancies. Additionally, events initiated in adjacent facilities, or on nearby streets
and highways, may be difficult to characterize because of a lack of information on the operations. In such cases, experience in
similar situations and limited available data can be combined with engineering judgment to determine the appropriate overall outage
and frequency categories.
HVAC Scenarios
Provision of adequate cooling to a data center is critical for operation. Should cooling flow be interrupted, shutdown of a data
center may occur in a matter of minutes, depending on the quantity and types of computer equipment in use. Well designed critical
computer facilities generally feature redundant capacity in critical mechanical equipment, such as chilled water pumps. As data
centers expand, increased cooling capacity demand and/or expansion into areas not originally intended for critical processing can
result in a reduction or elimination of design redundancy. Single-point-failure scenarios then arise from the potential failure of such
critical equipment. Corrective actions may include redistribution of load or the provision of additional redundant backup equipment.
Electrical Equipment Scenarios
Electrical supply systems in critical facilities often include multiple services, backup generation, and uninterruptible power supplies
(UPS). Such apparent redundancy may, however, be ineffective because of capacity limitations or critical equipment items that are
not connected to backup systems. Scenarios of this nature may arise, for example, during system expansion when additional loads
are connected to the power systems. Sometimes UPS systems have been found to lack the best bypass provisions for restoring
power promptly in the event of failure and to facilitate maintenance. Sometimes critical equipment (in the sense of single-point
failures) is not backed up by on-site spares with the resulting potential for prolonged outage because of the lead time required to
acquire and replace failed components. The implications of such scenarios are highly site-dependent and typically involve business
decisions, such as cost/benefit trade-offs between the likelihood of failure and the cost of mitigation actions.
Fire and Explosion Scenarios
A data center often has characteristics that present a fire hazard. Typically, numerous pieces of computer equipment are situated on
raised floors. Ordinary combustibles are generally limited to stacks of paper or boxes. Underneath the raised floor, there is often
extensive cabling, with PVC insulation representing more mass than vacant space. Fixed fire protection may include a Halon 1301
system tripped by cross-zoned smoke detection, with no protection provided under the raised floor. Mobile fire protection may be
limited to hand-held fire extinguishers, with the principal agent being carbon dioxide.
Fires in data center areas can arise from problems with the wiring, electrical distributions system components, and electronic
equipment (computer hardware, power switchgear, overcurrent protection devices, etc). The extensive cabling, particularly that
below the raised floors, enhances the fire risk. Based on historical accident data, in combination with site-specific data, such as the
mass and arrangement of the cabling, the probability of a fire occurring in the data center area is usually estimated to be somewhat
likely (0.0001 - 0.01/yr). A fire in such a data center, particularly under the raised floor, could result in significant downtime - in
excess of a week or more. This would typically be attributed to the mass of cabling in many areas and the toxic and corrosive
combustion products resulting from a PVC fire.
In considering risk control options, replacement of the halon systems would be considered a high priority because of restrictions by
the Montreal Protocol and the new Clean Air Act Amendments to control ozone depletion. Alternatives include pre-action sprinkler
systems, which could reduce the risk of water damage from a failed head. These systems can often use the smoke detection system
from the original halon system to reduce costs.
Beneath a raised floor, an obvious but generally difficult and probably impractical risk control option is to remove obsolete cabling.
A lower cost alternative is to install a very early smoke detection system in which the smoke induction associated with such a unit
reduces the activation time. Other options include a carbon dioxide system or a similar environmentally friendly gaseous
extinguishing system, improved fire detection (e.g., line detectors), and /or improved passive fire protection (e.g., fire-resistant
intumescent paints or fire stops).
External Events
Facilities are frequently located in densely populated urban areas. Consequently, risks exist from man-made events, including natural
gas pipeline or steam line rupture. The latter sometimes results in an airborne asbestos release. Such events could damage a building
or harm its occupants; in such an event evacuation of the building may be necessary to avoid exposure.
Natural phenomena also present risks to facilities worldwide. Not surprisingly, the phenomena of concern vary with geographical
location.
Risk-based Contingency Planning
To facilitate risk-based contingency planning for a particular facility, a matrix can be generated, which prioritizes events and
indicates responsible and affected departments. Such a matrix would allow each department to establish the relative significance of
each event to its own activities. The result could then be used to support development or enhancement of the facility's contingency
plan. Organizationally it would define responsibility for responding to specific situations.
The risk assessment results should be integrated with existing plans into a document, combining business continuity concerns with
environmental/health/safety and regulatory issues. At the facility-level, these areas are often interrelated and should be managed
accordingly. For example, if a facility experiences a fire, the downtime could be significantly minimized through a coordinated and
technically appropriate response, including means of egress, emergency lighting, evacuation and head counts. On this particular
topic, the United States has promulgated relevant regulations, and thus a facility contingency plan should reflect compliance with
29CFR1910.38.
Other issues with respect to the efficacy of contingency plans are represented in the questions that follow:
What is the plan update procedure and schedule?
Are there primary and secondary emergency operation centers?
What are the procedures for notifying corporate management?
What are the procedures for event diagnosis and assessment?
Are there designated assembly areas?
Are facility shutdown procedures coordinated with the disaster recovery plan?
Is medical emergency preparedness adequate to handle anticipated situations?
If the use of respiratory protection is anticipated, is the type, number and location of units appropriate, and have personnel been
trained in its use?
Is there a designated media briefing location?
Is there a policy to provide assistance to employees laid off because of the incident?
Is there an established source for photography and videotaping services?
These issues reinforce the fact that an effective and thorough contingency plan is an integral part of minimizing downtime events at a
critical operation facility. In fact, the plan complements the risk control measures generated as part of the risk assessment. It can
focus on events that cannot be eliminated practically, and that are relatively probable or more significant in terms of business
interruption.
Marian H. Long, PE, CSP, is a senior consultant in the facility Safety and Risk Unit at Arthur D. Little.
Disaster Recovery World© 1997, and Disaster Recovery Journal© 1997, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction in whole or
part is prohibited without the express written permission form Systems Support, Inc.