Risk Analysis (19)
One area of great importance to disaster recovery planners that has received very little attention is the question of how much to spend on the disaster recovery planning (DRP) effort. Through application of a “worst case” risk analysis process corporate officers can be effectively “sold” on the need for effective DRP, but what guidelines can be utilized in determining the amount of corporate resources that should be devoted to recoverability?
An excellent example of how to make such a determination is available in the insurance industry. Actuarial science is a field specifically directed at determining a reasonable price (including profit) for risk specific coverage. Actuaries use event frequency statistics to determine insurance rate. While it is unreasonable to expect DR planners to become proficient actuaries, we can certainly use actuarial methods in efforts to decide how much money should be spent on DRP.
An extension of simple “worst case” risk analysis methods can yield estimates of a company’s probable annual loss due to specific risk factors. Once an accurate picture of probable annual loss is developed, that loss figure can be utilized as a budgetary guidance tool.
Let’s look at a simplified example. Suppose X corporation has an IBM based host DP facility that supports their management of manufacturing operations. An impact analysis has demonstrated that in the absence of an effective DRP, X corporation will stand to lose $100 M if this data center is destroyed by fire. The DR Coordinator obtains information from his insurance carrier which indicates that a facility configured like X corporation’s data center can expect to experience a total loss with fire in 300 years. Assistance in determining event frequency can be obtained from insurance carriers and governmental agencies.
We now know how often the event (fire) is likely to occur, an what its impact will be. From this information we can estimate a level of probable annualized loss due to a totally destructive fire by utilizing the formula:
Annual Loss Exposure (ALE) = impact x frequency
We can express the frequency of once every 300 years as the ratio 1/300. So our formula yields:
ALE = $100,000,000 x 1/300 or
ALE = $100,000,000/300 = $333,333
This calculation tells us that the X corporation has an annual loss exposure of $333,333 due to totally destructive fire at the data center in question. In order to determine a total ALE of operation, take the factor ALE’s for all other risk factors that we wish to consider (earthquake, tornado, employee sabotage, etc.), and total them. Once our total ALE figure is determined, it can reasonably be used to guide budgetary decision making. It is simply bad business to spend more on DRP than you are “losing” on an annual basis.
Any competent risk analysis will include calculation of ALE. In fact, it is a cornerstone of the risk analysis method recommended by the National Bureau of Standards (NBS) in FIPS Publication 65. The NBS methodology presents a simplified method to ALE estimation which utilizes indexed tables. This method was originally developed by Robert H. Courtney Jr. of IBM, who gave permission to NBS to adapt the method to their needs. While this indexed table method does not yield ALE estimates quite as accurate as individual calculation, it is a viable way to obtain ALE figures that can be used as a broad budget guidance tool. FIPS Publication 65 and other NBS guidelines pertinent to DR and data security can be obtained from the National Technical Information Service:
National Technical Information Service
5285 Port Royal Road, Springfield, VA
22161, NTIS information (703) 487-4600
Regardless of the calculation method used, as the number of event types under consideration increases so does the volume of calculations to be performed. A PC based spreadsheet package can be an invaluable aid in these calculations. In addition, there are an increasing number of risk analysis consultants available to assist you. In any event, it pays to be an informed consumer when buying such services, and it certainly pays to have a financial yardstick available when cost analyzing your DRP alternatives.
Andrew M. Munro is a Disaster Recovery Planner with MCI Communications.
This article adapted from Vol. 2 No. 2, p. 45.
It is good news that many organizations are jumping on the disaster recovery bandwagon. Information security and disaster recovery practitioners have clearly scored some impressive successes. Management has become more aware of the need and has begun to allocate funds for security measures that we all knew to be important but found more difficult to sell in the past.
Disaster recovery is clearly an important means of containing loss when a disaster occurs. The key phrase here is “containing loss.” In any disaster, there will be substantial losses, no matter how carefully conceived and implemented the disaster recovery plan and disaster preparedness are.
Despite the increased comfort level we can enjoy with a carefully conceived and implemented contingency plan, something is missing. The barriers to loss are still incomplete. Contingency plans are effective weapons against unmitigated loss from a disaster, but they do absolutely nothing to prevent the disaster from happening. There are also many lesser threats that do not become disasters for which a typical disaster recovery plan is relevant. Misuse/abuse, fraud, theft of data, and data sabotage are only a few of the threats that fall into this “non-disastrous” yet potentially very costly category.
It is unquestionably worthwhile to have a tried and trusted disaster recovery plan in place. We get a warm, fuzzy feeling of security when we conduct successful disaster recovery plan tests and disaster scenarios. We are thus better prepared to cope with the real thing when it happens. But everyone hopes never to have to deal with a real disaster, and that warm, fuzzy feeling obscures the reality of potential losses that will still be incurred. Management is often particularly vulnerable to a false sense of security, especially when it has just spent tens to hundreds of thousands of dollars on disaster recovery planning--with ongoing costs of the same magnitude to keep the plan viable.
A real disaster will be costly in terms of denial-of-use (however well it is limited by the disaster recovery plan), disruption, destruction and human impact, no matter how well prepared we are. Therefore, it is clear that more should be done.
The missing link should be set in place to form a unified barrier to risk.
The missing link is Integrated Risk Management, as viewed from the information security perspective including all organizational and functional activities and controls that serve to assure the availability, integrity and confidentiality of information. Risk management is a familiar term in the insurance industry, but that definition is inadequate for the purposes of the information security practitioner and his interest in “managing” risk.
For information security purposes, risk management is the multifaceted process that includes the following:
- What can happen (threat occurrence)
- How bad will it be if it happens (consequences)
- How often will it happen (frequency)
- How certain the answers are to these questions (uncertainty)
Identifying vulnerabilities that increase risk exposure by allowing threats to occur with greater frequency, greater consequences, or both
Identifying cost-effective safeguards that serve to mitigate or eliminate vulnerabilities and reduce associated risk
This risk reduction is best achieved by first executing a credible risk assessment. The risk assessment supports risk avoidance/acceptance decision-making, i.e. risk management, by identifying probable loss exposures associated with the threats for which there are vulnerabilities at the target site. The complete risk assessment will also include recommendations for safeguards that cost-effectively reduce these loss exposures. The emerging concept of risk management may thus be represented as an organizational integration or coordination of classic risk management (insurance), physical security, data security and disaster recovery that enables a coherent orchestration of these often unconnected activities and their common goal of managing risk.
To make decisions whether to avoid, minimize or accept risk, management must know what the risks are, what their probable consequences (losses) are, what the vulnerabilities to risk are, and what steps can be taken to cost-effectively avoid or minimize risk. Note that risk acceptance is a legitimate management prerogative.
However, risk acceptance through ignorance of the facts has never been an acceptable excuse to executive management, the board, shareholders or constituents. The worst-case result of uninformed risk acceptance in the past has often been an unplanned and abrupt change in responsibilities. In the future, however, we will almost certainly see the Foreign Corrupt Practices Act of 1977 invoked when risks are accepted through ignorance and some substantial loss is suffered.
There is a trend toward greater government interest in the security of information in both the public and private sectors. This trend, as manifest in BC-177 (Disaster Recovery Requirements from the Controller of the Currency for the banking industry), OCC 220 and OCC 229, among other directives and regulations, is driven by a recognition that information processing is often critical to the successful pursuit of American business interests. The Foreign Corrupt Practices Act imposes significant penalties (felony fines and imprisonment) in the prosecution of both responsible management and the company which fail to maintain effective control over resources to the detriment of an organization and its shareholders.
While there are various ways to manage risk, the most effective approach to an Integrated Risk Management program is to establish and maintain a probabilistic risk model of the information processing environment in its broadest context. One of the best and most cost-effective tools for building, analyzing and maintaining a risk model is an automated probable risk assessment system.
Probable risk assessment does not presume to dictate whether management should avoid, minimize or accept risk. It does, however, provide management with reliable decision support information based on a defensible and substantially objective quantification of risk as opposed to a subjective qualitative ranking of risk. Therefore, with an effective Integrated Risk Management program, the information security and disaster recovery practitioner (the “risk manager”) can help management assure that risks (especially avoidable risks that could later result in disasters or other costly experiences) are not accepted through ignorance of the facts.
Yes, the contingency plan may very well “contain” losses arising from risks accepted ignorantly. But what if the disaster could--and should--have been avoided?
Will Ozier is President of Ozier, Perry & Associates.
This article adapted from Vol. 3 No. 1, p. 40.
In today's competitive environment, a business must achieve continual improvement just to stay even in the market place. Any interruption in one's presence in the market place is devastating. It is, therefore, incumbent upon management to respond immediately to any catastrophic event which interrupts the business and restore its operation as quickly as possible.
Subsequent to a catastrophe, many executives become distracted by the challenge of getting the building and equipment repairs completed rather than continuing their business function. This distraction may be challenging, but it is deadly. Businesses, large or small, begin dying the moment a catastrophe occurs. Restoration of business must proceed at the highest level emergency. After a serious catastrophe at a BASF Corporation facility, director of insurance Karl Heinz Jaeger, stated, “Business interruption losses can be a major threat to a company and in the worst cases could lead to bankruptcy for even the biggest of companies.”
Focus on the Customer
Customers, be they retail, wholesale, or service-oriented, must continue their supply from some source. Even if the damaged business can maintain a continued supply by virtue of partial operations, the customers feel it necessary to look for secondary sources of supply in case their now-damaged primary source of supply fail. If supply is interrupted, these customers must go elsewhere immediately, and their orders may be difficult to regain.
Beware of Hidden Costs
In addition to the strong potential for loss of business, there are other hidden, and often uninsurable costs which combine to create a devastating effect on the business. These hidden costs begin accumulating immediately after the disaster occurs. Some of these costs include:
- Vastly increased unemployment compensation premiums resulting from the layoffs in the work force.
- Substantial increases in advertising and special promotions expenditures necessary to rebuild the volume of business.
- Often underestimated and significant cost of training new employees or eliminating the “rust” from old employees who have been idle for a period of time.
- Increased production mistakes inherent in a restart with new or rusty former employees.
- Overall lowered level of efficiency in the operation which adds significantly to the cost of production.
These hidden costs may sound innocuous; however, they are deadly in 71% of catastrophes which produce a “temporary” facility closure.
Even when the damaged business regains its pre-catastrophe volume, generally there will be a significantly reduced profit. In a worst case scenario, after a catastrophe there will be a net loss where that same volume during the pre-catastrophe period would have resulted in a reasonable profit. This is due to the combined effect of the hidden losses which accounting systems are generally not set up to track. Consequently, the business person is often unaware of the problems which are causing cash flow difficulty.
These circumstances contribute to statistics cited by BASF/Wyandotte which show that 43% of businesses closed by a catastrophe never reopen. Twenty-eight percent of those that do reopen, experience financial failure within three to five years. Those that never reopen simply do not have the financial resources to weather the period of time they are closed due to the catastrophe.
These numbers include those which are well insured because many of the hidden costs are not insurable expenses. Those that are insurable are often under-insured due to underestimating the maximum foreseeable loss. Clearly, immediate action must be taken if a business is to have any chance of recovery.
After a catastrophe, the insured should immediately concentrate on the health and continuation of the business. Sales staff should contact customers, thank them for their past loyalty, and assure them an aggressive effort is being taken to restore the business and, therefore, the supply. Appropriate management staff should have immediate and frequent communications with the employees so they are available when the business reopens. Accounting staff should follow through on collections, billings, payables, and vendor communications. Furthermore, management should focus on locating additional inventory, preparing reopening advertising, and developing new promotions to restore the business.
The restoration of a facility should be left to professionals capable of doing so at a high rate of speed, while working closely with the insurance provider. It should be obvious by now that the fastest restoration of the facility and equipment is crucial for a business unable to relocate.
Utilizing a team approach, with the insured focusing on the continuation of the business, a reputable high-speed specialist restoring the building and equipment, and rapid funding of the restoration by the insurer, the facility should be back into operation in the least amount of time. Anything which shows the process can be devastating for the business.
Other alternatives that take additional time will, with rare exception, prove to be devastating to the business regardless of advantages they may appear to have.
Nelson Bean is president of The Evans American Corporation, Houston, Texas.
Post-incident review (PIR) is an evaluation of incident response used to identify and correct weaknesses, as well as determine strengths and promulgate them. PIRs are normally used to support program revision. Despite its importance, PIR is one of the most neglected components of disaster recovery planning.
Imagine you have just survived a natural disaster. After weeks of intense response and recovery efforts, fortunately you are still in business. You’re exhausted and glad it is over. But a critical task awaits. Now, while your memory is fresh, is the time to learn from what happened and use the lessons to enhance your program and plans; don’t assume they will be remembered. All too often, managers fall into the common trap of waiting until later and losing the opportunity. This is the moment to exploit your boss’s fear that this could happen again in order to get the support you need. The organizations best equipped to survive and thrive are those that mature beyond the normal reflex of respond, recover and continue.
Applying hard learned lessons to a total disaster management program just makes sense. Better yet, go beyond disaster management, with its site specific focus, to crisis management and look at the bigger strategic picture. There are several things you should ask yourself:
- What can be learned from what happened?
- How do you avoid repeating mistakes?
- How do you assess what is and is not working?
- What are the implications of what just happened not only on you, but on your whole corporation or industry?
- Are program and plan revisions needed?
- How do these questions get answered? The best way to answer these and more is to conduct a post-incident review. Here is how the process works.
The post-incident review process begins with determining who will conduct the PIR. An effective review depends heavily on the objectivity of the review team. For that reason, you should select a team of individuals that are not part of your local organization, or, if from your site, were not involved with the response to or management of the incident. (The responders and managers will have an opportunity to provide their input later in the process.) The team should provide expertise in management, human factors, communications, planning and training. The team should include specialists that are technical experts in particular areas of concern for the specific incident. Specialty areas may include disaster response and management, fire, hazardous materials, environmental impacts and regulations or hostage situations. Several members of the team should also have strong interpersonal skills to facilitate capturing information through discussions and interviews with incident managers and responders. The team should have access to an advisory group of managers and senior leadership from within the organization that experienced the incident. These advisors help guide the activities of the team toward the philosophy of the organization. Their direct experience also assists with the assessment of how management responded to the incident and what long term effects have occurred as a result of their actions or the incident itself.
Once the team is assembled, its first step is to determine goals and objectives. What do we want to get out of this effort? A primary objective is to learn from what happened so your disaster management, response and recovery programs can be enhanced. Clearly defining the areas that the team will analyze should enable the team to make specific recommendations for improvement. Key areas of consideration include:
- Mobilization procedures for personnel and equipment;
- Implementation plans and procedures;
- Management and coordination of emergency response;
- Stakeholder reaction;
- Internal and external communications;
- Post-incident perception; and
- The short and long term consequences of the incident.
Based upon the objectives and areas of consideration, review questions are developed. These questions will, among other things, seek to explore each important aspect of the incident. They should be applied to each available source of information on the incident; plans, procedures, records and participants (through interviews). While the questions are being developed, another part of the team will begin a records review to build a list of incident participants.
The next step is to conduct interviews. During interviews everyone involved with the actual response, management, or recovery effort should be provided the opportunity to supply input. No one person can see, hear, or know everything that happened. Often it is not practical to interview everyone, however, it is necessary to ensure an adequate cross section of those involved with the incident is covered. During the interview process it is important to obtain a series of important pieces of the puzzle.
The first piece is the basic, “What happened?” This information is used to build a time line of participants’ actions separate from those found in incident records. Another piece is the cause of the incident. Often, participants can provide valuable insight into why the incident occurred and what might be done to prevent it from happening again.
The short and long term consequences of the incident are another piece of the puzzle that can be obtained through the interview process with assistance provided by management. Participants can also impart the reactions and post-incident perceptions of the community and other organizational stakeholders. The participants’ perception of the strengths and weaknesses of the actions of the organization should also be documented.
Concurrent with the interviews, portions of the team will begin to analyze the implementation plans and procedures while other portions continue an in-depth records review. The records and plans review efforts will also develop time lines of what happened and what should have happened.
These documents are further surveyed to reveal strengths, weaknesses, and concerns based upon organizational standards and the disaster recovery and crisis management expertise of the reviewers. These portions of the team should develop checklists from the review questions used by the interviewers. Using a checklist with a comprehensive description of each area of consideration during plans analysis and record reviews helps keep these parts of the PIR objective and complete.
During the review phase, it is important to begin looking at the values and rationale that were applied during the planning process and by managers and responders in reaching decisions concerning response and recovery operations. This is especially important if it appears that deviations from the organization values occurred and if that variance had a direct effect on the response and recovery operations.
After the records review, plans analysis and interviews are completed, the team reconvenes to discuss and analyze their findings and develop a post-incident review report. Time lines developed by each group should be evaluated to identify points of deviation and convergence. Checking areas of divergence closely to determine where the plan was not followed will help identify candidate areas for planning or training enhancements. The individual perceptions of strengths, weaknesses, and concerns will be compared with the impressions and findings of the team’s record review. The team should emerge with a clear picture of what happened, what should have happened, and what should happen next. The picture is then assembled into a report of the post-incident review. A PIR report does not have to follow any special format and should only be as detailed as necessary to be a useful tool for crisis, disaster, and emergency planners and managers. The report should include recommendations for program enhancement or other modifications. It should address the following items:
- A consolidated event time line;
- Incident cause and recommendations for future correction or prevention;
- Mobilization process, including notification of personnel and activation of facilities (this is particularly important in reviewing the time required to respond to an incident involving hazardous materials that could pose a threat to the surrounding community);
- Prevention, mitigation and response equipment performance and procedures;
- Implementation and performance of disaster response and crisis management plans and procedures including strengths, weaknesses, and concerns;
- Management and coordination of disaster response and crisis management actions of those involved in responding to the incident;
- Community and other stockholder reactions, especially any actions initiated by community emergency managers to protect its citizens;
- Post-incident perception of organization performance, as revealed during interviews, in press reports, by changes in stock price, by investor reactions, etc.;
- Company, corporation, or industry consequences, especially if alternative technologies are available;
- Key “lessons learned” listed separately, to facilitate the implementation of enhancements that may be required.
Based on the PIR, the disaster recovery and crisis management programs should be revised to improve future performance. This could lead to revisions in several areas:
- If the incident had not been previously identified as a potential hazard or vulnerability in the disaster and crisis plans then it should be added, and the hazard and vulnerability analysis should be reviewed;
- If the report revealed weaknesses or gaps in the organization, the disaster response and/or crisis management structure should be modified;
- If the policies and procedures did not address issues that became important during the incident, policies and procedures would need to be developed for those areas;
- If response went poorly due to a lack of training, exercising or planning, these areas should be enhanced or modified and personnel should be familiarized with the changes; and
- In areas where participants diverged from their existing plans and response or management operations went especially well, the disaster response and/or crisis management plans should be modified to reflect the reality of success.
The post-incident review process clearly provides an opportunity to learn from disasters and crises. Applying lessons learned to your disaster and crisis management program allows you to bring your procedures into focus with reality, and more importantly, it enables you to use the incident as a means of improving your program to better prepare for future situations.
While we never hope for another disaster, if one should occur again, your response, management and recovery operations should be smoother and more successful due to your post-incident review efforts.
By remembering the past, reinforcing strengths and enacting enhancements, we will heed the warnings and not be condemned to repeat history.
Mark Morgan is a Senior Associate with the Corporate Response Group, Inc. in Washington D.C.
Effective contingency planning and disaster recovery coordination require expertise in all aspects of disaster management, including avoidance and recovery. It is too late to plan an effective response after a disaster has struck and significant downtime has been incurred. The resulting outage from such a disaster can have serious effects on the viability of a firm's operations, profitability, quality of service, and convenience. In fact, these consequences may be more severe because of the lost time that results from inadequate planning. After such an event, it is typical for senior management to become concerned with all aspects of the occurrence, including the measures taken to limit losses. Their concerns range from the initiating event, and contributing factors, to the response plans, ffective contingency planning and disaster recovery coordination require expertise in all aspects of disaster management, including avoidance and recovery. It is too late to plan an effective response after a disaster has struck and significant downtime has been incurred. The resulting outage from such a disaster can have serious effects on the viability of a firm's operations, profitability, quality of service, and convenience. In fact, these consequences may be more severe because of the lost time that results from inadequate planning.
After such an event, it is typical for senior management to become concerned with all aspects of the occurrence, including the measures taken to limit losses. Their concerns range from the initiating event, and contributing factors, to the response plans, equipment, training, and recovery operations used to counter it. Rather than delegate disaster avoidance to the facilities or building security organizations, it is preferable for a firm's disaster recovery planner(s) to understand fully the risks to operations and the measures that can minimize the probabilities and consequences, and to formulate their disaster recovery plan accordingly.
Crucial to the effective management of response to accidental loss is the ability to recognize risk. Colloquially, we use the term risk to refer to the possibility of any loss, regardless of its size. For example, I might casually comment to a friend over dinner that she risks indigestion by eating spicy foods. The disaster recovery professional, however, is primarily concerned with accidental losses that can have a serious, harmful impact on company finances. From this perspective, risk is the possibility of significant financial impact.
The disaster recovery planner’s intuition of risk, for a hypothetical firm, is shown in the accompanying diagram. When the probability and size of loss (indicating possibility and financial significance, respectively) are both high, risk exists. On the other hand, risk is not associated with very low probability of occurrence, or with losses that under any other circumstances would be considered “affordable”. Note that there is a gray area between probability/loss combinations that are truly risky, and those that are not. This reflects the fact that the boundary between risky and non-risky events is fuzzy, not exact. We simply do not know enough about the real world properties of risk to be able to apply the concept precisely.
To assess the risk faced by the organization, the planner matches the probability and loss characteristics of various exposures to his or her intuition of risk. This exposure analysis can be most effectively carried out using loss scenarios. A scenario is a synopsis of events or conditions leading to an accidental loss. Scenarios may be specified informally, in the form of narrative, or formally using diagrams and flow charts.
THE RISK ASSESSMENT PROCESS
Risk assessment using scenarios is straightforward. Consider three loss scenarios facing our hypothetical company. For concreteness, let us assume the firm is in the business of transporting various cargoes, some hazardous. The three scenarios we will limit ourselves to all involve the legal liability arising from use of company autos on public roads. The probability/ loss combinations associated with these scenarios are shown on the diagram on page 69. Point A represents the scenario of an upset or overturn of a truck carrying dangerous cargoes in a populated area. It is further assumed that the spill leads to an explosion or release of toxic chemicals. Point B represents the company’s liability for an accident involving bodily injury and property damage from relatively “ordinary” road hazards. No spill or disruption of cargoes is involved. Finally, point C identifies a scenario involving multiple simultaneous catastrophes involving the company fleet.
The identification of probabilities and loss potentials associated with a scenario is usually performed by engineers and actuaries, based on statistical data and expert judgement. Scenario A has a probability of occurrence of 10-3 (.001, or one chance in one thousand) and a loss potential of $50 million. It is deemed sufficiently “possible” and significant so as to be unequivocally classified as “risky”. Scenario B, on the other hand, while more probable than A, involves losses that this firm considers “affordable”. As such, it is rated not risky with confidence. Not so easy to classify is scenario C. While the probability of multiple catastrophes is not strictly zero, it is rare (around 10-6, or the proverbial “one chance in a million”!). So while the loss potential is great, the chance of occurrence is “virtually impossible”. Scenario C, nonetheless, resides in that gray area of risk that results in considerable anxiety over its classification.
In practice, many more scenarios can be added to the diagram. This gives the analyst a complete risk profile of the organization’s exposure to accidental loss. Scenarios can also be constructed by individual departments or operating units within the organization. These individualized scenarios are easily combined to give an organization-wide picture of risk.
Often, the analyst’s measurements of probability and loss potential will themselves be inexact. Uncertainty is easily accommodated by ascribing a range of probabilities and/ or loss potentials to a scenario. The degree of overlap of these ranges with the analyst’s definition of risk determines the overall “riskiness” of the scenario. While the organization’s picture of risk may be rather rough, it can provide valuable guidance to disaster planners and other responsible for the effective management of risk.
It is important that when uncertainty exists, it be properly communicated. Studies have shown that decision makers react differently to uncertain information than to exact information. Under conditions of uncertainty, decision makers tend to make their responses more flexible. Masking the uncertainty involved in an estimate of risk can, therefore, lead to inferior decisions.
It is equally important that the analyst not introduce too much vagueness into the process, by using undefined qualitative expressions. Simply rating the probability of disaster associated with some scenario as “high”, “medium” or “low”, for example, introduces such vagueness. In communicating risk, the disaster planner must make sure that the range of probabilities represented by these words is understood. By specifying the range of probabilities associated with words, as in the accompanying risk diagram, we can prevent such confusion. Ranges, possibly graded by confidence level, provide a mathematical structure that can be manipulated just like exact estimates. The difference is that the uncertainty of the estimate is preserved.
The uses of scenario-based risk analysis are many and varied. The explicit analysis of scenarios may suggest ways of reducing or eliminating exposures through loss control activities. Loss control actions have the effect of shifting where scenarios lie on our risk diagram by reducing probability of loss, amount of loss, or both. Often, scenarios are posited on the basis that loss potential is as low as reasonably achievable (“ALARA”). This type of analysis recognizes that even under the best of loss control programs, accidents will happen.
As the cornerstone of disaster recovery planning, scenario-based risk analysis allows identification and prioritization of disaster potential. Knowing what can happen, and the risk involved, allows the analyst to make effective plans for business recovery in the event of disaster. By concentrating on risky scenarios, the disaster recovery planner can tailor recovery actions to exposures. This ensures the best allocation of resources in the time of crisis.
The diagrammatic approach demonstrated above is easily incorporated into disaster recovery plans. It provides a basis for the formal, yet realistic, analysis of risk. During its construction, company management becomes aware of the various potentials for serious accidental loss within the organization, as well as their probabilities. The added focus makes for better plans.
Mark Jablonowski, CPCU, ARM, is a Risk Manager for the Hamilton Standard Division of United Technologies Corporation.
California businesses have numerous disasters every day. With Earthquakes, Flooding, Hazardous Material Accidents, High Winds, Power Outages, and the occasional Structural Collapse due to metal fatigue. Who knows when such an incident will happen in YOUR VICINITY? This may cause damage and destruction to your company, serious injuries to your employees, shutdown your business operations, and cost you tens of thousands of dollars in litigation and compensation.
To reduce and eliminate these costly problems you must be prepared. Contingency and Emergency Action Planning is seriously important to ensure the safety of employees and to keep your company operating through an Environmental, or Technological "disruption".
Proper preparedness for such a catastrophe consists of a well coordinated on-site implementation of Disaster Recovery and Resumption Systems and Emergency Supply placement. Your Safety Committee should start with an assessment of the following five conditions, as outlined:
- Remote Sites versus Hot Sites: Permanent locations (satellites and subsidiaries which are distant to your main operation; as compared to temporary (on-site) operating centers, such as trailers.
- Complete plan testing and evaluation: Simulated Exercises which put theory into action. This coordinates all areas into an effective operation, where everyone does their assigned jobs to test their strenghths and weaknesses. This also helps to prevent panic, and restore professionalism in an actual crisis.
- Hazards Identification: Structural Facilities, Office and Data Systems, Equipment Security, Lifelines and Utilities, Stock and Vehicular storage areas, the list is almost endless.
- Survival Supplies: Food and Water for a minimum of three days, for EACH PERSON...including your stranded clients and delivery persons. Blankets, First Aid and Medical "Trauma" kits; Hygiene and other sanitation items; and don't forget about the temporary Morgue.
- Personnel: Life Safety skills and techniques: Your staffing should be trained in Disaster First Aid, CPR, Corporate Survival and Industrial Safety, and the fundamentals of Search and Rescue. You may want to consider Basic Fire Suppression, Survey and Control of Hazards, and possibly even an orientation to Emergency Radio Communications.
- Helpful Techniques of Employee Education: Personnel will learn more effectively in an environment they are comfortable in, rather than an offsite "classroom". You may use a lounge, break room, a large work area (clean, of course), or even a conference room. Provisions of proper refreshment helps with attentiveness, such as: fresh fruit, cool juices and herbal teas, as compared to certain hot drinks, donuts, and other sugar infested snacks which seem to cause drowsiness or nervous fidgeting to some people.
The instructors you choose to train your personnel should have the proper experience in the respective fields of Emergency Management and related specialties, and have taught through various agencies which would prove their versatile experience. Examples are the American Heart Association and the Mine Safety and Health Administration (U.S. Department of Labor), to name a few.
There are legal and theoretical reasons for personnel training. When a disaster occurs, the personnel on site are the "First Responders". Mainly, they need this education to be able to help their injured co-workers, especially if the disaster is such as one that covers a large geographical area and the Fire Department is unable to respond.
Also, under the circumstances of current Occupational Safety and Health regulations, Disaster is classified as a "potential and unforeseen hazard", and therefore must be included in the SB 198 program. This means there must be a curriculum of continuing training, and safety updates which is necessary education for your own protection, both in the physical and liability sense.
All your Contingency plans and employee training will then go hand in hand with your completed Cal-OSHA Injury and Illness Prevention Plan. This is one of those OSHA "Gray Area traps". As we should all know, "Better safe, than sorry" IS BETTER than "Live, and learn (the hard way!)."
Scott Garig is the Chief Executive Officer, and co-founder of the California Regional Emergency and Disaster Services, which provides Consulting and Education.
The enlightened businessman prepares plans to deal with emergencies. He knows that being ready to deal with disasters--both natural and human-made--can make the critical difference to the bottom line should disaster strike. Detailed emergency planning is particularly important for those businesses that:
- Are just-in-time suppliers, in particular those that are liable to incur large penalty charges when they can not deliver on time due to a production interruption.
- Are supplied by just-in-time vendors, when a vendor's failure to deliver can result in costly production stoppage.
- Operate on tight margins where significant downtime can be crippling.
Serious emergency planning requires: first, specifying the critical aspects of the business--those that really require protection and rapid restoration; second, defining the threats to those critical aspects; and third, developing detailed action plans for prevention, protection and mitigation.
Good emergency planners follow this prescription, but generally do so in a qualitative or deterministic way.
More advanced, comprehensive plans go a step further and apply probabilistic methodologies in such a way as to quantify risks.
Quantitative Risk Assessment (QRA) techniques are routinely used today in situations where the effects of a disaster might be of such magnitude that a higher level of planning detail is required. Examples of operations where QRA techniques are routinely applied included commercial nuclear electric power stations and the Department of Energy's nuclear weapon production facilities.
With the manufacturing sector's recent shift towards just-in-time stocking practices, interruption of production flows can have far reaching and serious consequences.
The failure of a small supplier might cause downstream effects, ones that ripple through an industry where operations occur at facilities sequentially.
The microelectronics industry provides a good example. Consider a chip maker that provides custom microprocessors on a just-in-time basis to a manufacturer of automated machine tools.
A natural disaster that interrupts the microprocessor production line will shut down the machine tool line in turn. QRA's of such operations can help the manager to understand the risks involved, and where to focus capital in an effort to minimize risk.
In order to conduct a QRA of an industrial operation, a considerable amount of research is required. The production operation itself must, of course, be well understood. Environmental factors specific to the manufacturing site must be researched. Here weather history, seismic activity, flooding vulnerability (including seiche and tsunami) and other similar phenomenology must be understood in detail.
Site susceptibility to hazards released from or generated by nearby industrial operations and transportation systems must also be included in the analysis. Accordingly, QRA methodology can be applied to an industrial operation in a six-step process.
- The analysis requires: The identification of the critical paths in the production operation. Here a thorough knowledge of processes, supply stock levels, normal and emergency energy supplies, etc., is required.
- The identification of the utilities, supplies and services that are critical in supporting the production operation.
- The identification of the threats, natural and human-made, that can interrupt production directly or interrupt the flow of supporting supplies and services. Development of frequency distributions for these threats is also required.
- The determination of the time required to restore interrupted operations, taking into account emergency plans and capabilities of the company, various levels of government, and individual supply/service providers. Frequency distributions are required since time-to-restore is generally a function of severity of threat. In these distributions, the impact of each level of threat on each operation/supply/service is developed.
- The rank-ordering of the threats to production. Mathematical methods for convoluting the probability matrices are used in this process.
- And, following careful review of the analytic results, preparation of recommendations focused on consequence mitigation. Here both preventive measures and planned emergency actions are considered. Recommendations for prevention and preparedness actions can be ranked in order of cost-effectiveness.
To illustrate this process, an example is useful. Emergency planning managers at an electronics manufacturing firm engaged in just-in-time supply of custom microprocessors were in the process of updating emergency response plans.
In the course of conducting a qualitative review of their operations, they identified a large number of threats with the potential to interrupt an almost equally large number of suppliers and services, all critical to the operation.
The planners determined the following threats and critical supplies and services were of sufficient concern to warrant application of QRA techniques:
- Threats: aircraft crash, earthquake, flooding, nearby industrial accident, lightning strike, snow/ice storm, tornado/wind, transportation accident.
- Critical Supplies/Services: bottled compressed gasses, bulk chemicals, electric power, natural gas, water, sewer service, site access.
In conducting the analysis, the probability of each of the threats was developed for several levels of threat severity. The effect the various threat levels had on each of the critical supplies/services was assessed. The ability of emergency response groups to restore supplies/services was determined, and last, the threats were rank ordered.
As can be seen, the loss of water supply is the most likely production-interrupting event, followed by loss of site access.
The analysis also provided a rank-order of threats within each supply/service category. As a result of the QRA analysis, many actions were proposed to reduce risk to production.
The two proposed mitigating actions ranked highest in terms of cost-benefit were:
- Construction of an on-site water holding tank with sufficient capacity to supply production for one day. Having such a tank would significantly reduce the overall risk of production stoppage.
- Requesting a neighboring industrial facility to move or reduce quantities of stored chlorine gas. Required evacuation (loss of access) due to an inadvertent toxic gas release at this neighboring facility was found by QRA analysis to be the second most likely cause of loss of production capability. The potential impact of this threat had not been understood prior to the conduct of the QRA analysis.
Generally, in conducting QRA analyses, specific local data can be found that characterize threat levels for a facility. Where local specific data is not available, point estimates can be made based on Federal Emergency Management Agency (FEMA) and other federal, state and local agency generic data.
In assessing the threats, computer programs have been developed to operate on the probability distributions of threats, threat severity, and emergency response capacity. This allows a rapid calculation of rank-orders, and varied ways of quickly displaying the results.
The conduct of a QRA analysis requires the use of professional emergency preparedness and planning analysts.
The development of a QRA for a typical mid-sized manufacturing facility might involve two or three months of analytic effort. However, this small investment in detailed analysis of threats to production will have high payoff in mitigating the effects of potential disasters.
James R. Lynch is a Senior Engineer with Science & Engineering Associates, Inc., of Albuquerque, N.M.
The risk analysis process provides the foundation for the entire recovery planning effort
There may be some terminology and definition differences related to risk analysis, risk assessment and business impact analysis. Although several definitions are possible and can overlap, for purposes of this article, please consider the following definitions:
- A risk analysis involves identifying the most probable threats to an organization and analyzing the related vulnerabilities of the organization to these threats.
- A risk assessment involves evaluating existing physical and environmental security and controls, and assessing their adequacy relative to the potential threats of the organization.
- A business impact analysis involves identifying the critical business functions within the organization and determining the impact of not performing the business function beyond the maximum acceptable outage. Types of criteria that can be used to evaluate the impact include: customer service, internal operations, legal/statutory and financial.
Most businesses depend heavily on technology and automated systems, and their disruption for even a few days could cause severe financial loss and threaten survival. The continued operations of an organization depend on management’s awareness of potential disasters, their ability to develop a plan to minimize disruptions of mission critical functions, and the capability to recover operations expediently and successfully. The risk analysis process provides the foundation for the entire recovery planning effort.
A primary objective of business recovery planning is to protect the organization in the event that all or part of its operations and/or computer services are rendered unusable. Each functional area of the organization should be analyzed to determine the potential risk and impact related to various disaster threats
All too often, presentations on how to perform a 'business impact analysis' fall short on the 'analysis' component of the project. There have been scores of wonderful presentations on 'risk analysis' questionnaires, describing which areas should be surveyed, with in depth discussions on what to look for, but all lacking substance on how to proceed once the data has been collected.
This lack of direction is primarily due to a failure to develop a scoring process for responses along with guidelines which would help to standardize responses into a discreet range of possible answers. In addition, most surveys fail to compare the survey responses to distinct threat probabilities to determine the level of risk. And most importantly, they do not provide a means for measuring the effects of 'what-if?' scenarios to illustrate various strategies which would mitigate the effects of disasters.
Most people believe that the questions that need to be asked are subjective, and any attempt to score responses would result in arbitrary numbers without any true relationship to their importance. This belief would be true if there were no structure given to the range of responses and no intelligent thought given to the phrasing of the questions posed.
For example, some questionnaires ask if there is physical security protection in the building, and perhaps allow its reliability to be rated on a scale of 1 to 10. Of course, depending on how they are viewed by the respondent, answers could vary widely. Also, there is no indication, in the form of a guideline, as to what to evaluate, or how to rate each element of the category.
Since most surveys are performed on paper questionnaires, any attempt at providing imbedded guidelines only lengthens and complicates the survey forms. This results in a reduction of responses or a lack of cooperation due to the limited amount of time the respondents have to dedicate to the task.
Through project planning, providing guidelines, scoring of responses and analyzing results the survey team can develop a methodology that will result in documented evidence to support their conclusions and recommendations. The following are guidelines which will assist in developing a successful analysis.