Risk Analysis (19)
Will your Business Recovery Plan stand up to a thorough, well structured audit?
Most corporations have a Corporate Audit Department which is charged with also auditing these plans. But most such audits tend to consist of a request to see the hard copy of the plan, to check the most recent review date and the most recent test date. Even when the corporation is subject to audit by Federal and State Examiners these tend to look only for the same physical evidence.
Unfortunately most audits, internal and external, tend to be viewed as a nuisance at best, which is a shame, because a good audit program can be enormously helpful in identifying weaknesses in a plan, or areas that should be reviewed.
Is the current, generally cursory type of audit enough? Will such an audit determine whether the plan is sound, whether the corporation really would be protected if it relied wholly on the plan? I would suggest that such audits should be much more searching to be of real value.
The following questions are among those that should be asked by an auditor in determining the adequacy of the Business Recovery Plan and the process.
The Plan Manager
By Damom Arber
What experience does the Plan Manager have? Is the Plan Manager considered to be a professional, or is this a part time function? Is this a job that was given to someone because the corporation either didn't have another slot for the individual or because they 'had to have someone seen to do the job'?
Did the Plan Manager receive any training? Did he develop and write the plan? Does he actively participate in review and exercising as someone whose function needs to be restored, or does he manage these processes? In a plan for a small unit the Plan Manager may well manage all three aspects and be a participant in the recovery activities but in a large unit he should establish the criteria and have these done by the department.
In exercising the plan, he should establish the goals, the parameters and the success factors. He may well be the test manager but should not be actively engaged in the details of the recovery process.
Determination of Criticality
How was the plan developed? Was a Business Impact Analysis completed? Did completion of the BIA involve the department executive and management and at least a representation of the department staff?
How was the criticality of the functions that are performed by the unit determined? Was it based on the unit manager's decision that 'of course this is a critical function'? Or was it determined by using parameters established through use of one of the expert systems available?
If an expert system was used and there was a discrepancy between the unit manager's determination and that of the expert system how was this discrepancy resolved? Generally any responsible unit manager's thoughts on the criticality or otherwise of that unit's functions are pretty accurate, but should be objectively or externally confirmed.
Was a Critical Resources Analysis performed or, were the critical numbers and requirements plucked out of the air based on the perception of the person(s) determining the criticality of the functions. This is not to say that these numbers are necessarily incorrect, the people working in and managing the department generally have a good 'feel' for what is critical. The danger with relying on this is that too much may be considered 'critical' rather than too little.
A rule of thumb for people not professionally involved in plan development and maintenance seems to be 50%; that is, 50% of current resources would be required to restore the critical functions of the unit, (most people consider their own job at least to be critical).
In fact critical resources may be nearer 20% - or 70% in the context of immediate short term recovery, say the first 30 days. But this will only be determined if the CRA was done and done properly. Remember that if the unit considers that 50% of current resources is necessary, when in fact only 20% is required, the cost of maintaining these extra resources could be considerable, money that could be better put to use elsewhere. Conversely if 70% is actually required any recovery could be severely hampered due to lack of resources if only enough are available to restore 50%.
Siting of Recovery Facility
Does the siting of the recovery facility make sense, or is it just a convenience, or perhaps an inconvenience? If a disaster were to strike the current work site would the recovery site also be impacted because it is too close, or on the same power or communication grid?
On the other hand is the siting of the recovery site inconveniently far away merely because of a policy that states something like 'the recovery site must be at least 10 miles from the original site'. This sort of blanket statement makes sense in an environment subject to hurricanes, tornados, floods or earthquakes,but in those circumstances even 10 miles may be inadequate.
However, in a geographically stable location that is not subject to such extreme climatic disturbances such a blanket ruling may be unnecessarily broad. Most disasters that businesses face are of the nature of a fire or explosion or internal flood caused by triggering of sprinklers.
In the case of a fire or explosion there is every likelihood that the fire and police chiefs will put up a cordon going out a couple of or a few blocks at most. An internal flood causing evacuation of the building would not affect buildings in the immediate vicinity. It can be and is argued that the Chicago flood is evidence that a recovery site should be situated well outside the city limits, but I would suggest that that circumstance was an aberration. There are few cities outside the recognized earthquake zones and flood plains with the kind of situation and structure that would lead to their being subject to that sort of catastrophe.
The city with probably the greatest experience of coping with the kind of disasters that most businesses face is London, England, it having been hit with a number of terrorist, IRA bombings over the past several years. In the experience of the London Metropolitan Police damage is limited to an area of approximately a quarter mile radius. Any businesses outside that distance would be unaffected, except perhaps for power and communications. (The reason I mentioned above that a recovery site should be on different grids from that of the original site).
Notwithstanding the above comments there may well be a very practical reason why the recovery site is a hundred or a thousand miles from the original; another plant with additional capacity available for instance. But siting is something that should be objectively questioned.
Copies of Plan
How many copies of the plan are there, and where are they? Is there one for the whole department, perhaps locked in the Plan Manager's credenza? Do the rest of the staff have access to the plan - should they have access? Is the plan considered confidential, perhaps because it contains material of a sensitive nature? If so is it treated like a confidential document?
Is there an alternate to the Plan Manager, in case a disaster happens during the absence of the latter? Does this individual have a copy of the plan? Do both the Manager and alternate have copies offsite for the eventuality that they are unable to get into the original site? Is a copy maintained at the recovery site? Are all the available copies of the same vintage?
How are the copies controlled? Are they numbered and the numbers and location recorded? Is there a process for transferring the copies when the Plan Manager or Alternate change? Is it necessary that the copies should be so controlled?
Staff Training and Awareness
Have all those who would be directly affected in a recovery been made aware of the existence of a recovery plan, what their activities and functions would be in the recovery process? Are they kept up to date on amendments and changes and is there a process for ensuring this is done? It's quite surprising how often it's taken for granted that everybody affected knows what's going on and what would be expected of them, when in fact just the opposite is true.
Do the rest of the personnel know there is a plan and what they would be expected to do in the event of a disaster, even if it is only to go home and await further instructions? Are the staff provided with any kind of document on which basic information is recorded, e.g. emergency telephone numbers, address of recovery site? If so, is there a process for keeping this up to date?
Are the executive included in the training and awareness process, as an integral part of the plan rather than just to be seen to be on site?
Is there a forum and process for staff to question aspects of the plan or recorded recovery process, and to add their experience and expertise?
Off-Site Storage of Documentation
Is there an adequate procedure for off-site storage of data tapes and any documentation considered critical to a recovery? How frequently is the data sent off site, daily, weekly or less frequently? Is the frequency realistic, if it is weekly for instance how would the unit recover the information lost from the time of the most recent tape and the date of a disaster - which may be as much as six days later?
Is the off-site storage company professional, are their premises secure, are the tapes picked up in a secure container?
And probably the SINGLE BIGGEST CAUSE OF FAILURE in a recovery: HAVE THE BACK-UP TAPES BEEN TESTED? Tapes have a useful 'shelf life' of no more than several months, if they are continuously recycled over a period of a year or more it may well prove when the unit tries to access the information thought to be stored that it is in fact unrecoverable. Tapes more than, say, six months old should be replaced with new ones, dates of tape usage should be recorded and responsibility acknowledged.
Has the ability of the off-site storage company to respond in an emergency been tested? During a regular exercise the company is given notice as to when and where the tapes would be required and can position themselves accordingly.
But what would be their response to a call for service at 2:00 in the morning? Any responsible off-site storage company would be willing to acknowledge the need of the corporations for which they are providing the service to require that they can respond twenty four hours a day, and to test this level of service.
Is the unit dependent on other units from which it receives or to which it provides hard or soft copy, materials, work in progress etc.? If so do the other units have viable recovery plans which also acknowledge these same interdependencies? If the recovery sites of the affected units are not in the same building are there procedures built into both units' plans that will enable the interdependent processes to be restored? In fact if the recovery sites of the impacted units are in the same building are there similar procedures in the plans, remember the means of communication, delivery and transportation will have been affected by any disaster?
Emergency Response & Recovery Teams
Are the phone numbers of the civic, emergency response teams current? Do the police and fire departments have a copy of the floor plans of the unit's building? Do they have the phone numbers of the plan manager and alternate? The authorities may not want to keep these numbers on file because of the need for them to be maintained on a regular basis but they should have been approached.
Are the members of the various recovery teams aware of their responsibilities and functions, do they have a copy of an action plan? Are their phone numbers, business, home, emergency, cell, pager etc. on file and current? These should be randomly spot checked.
Testing and Exercising
Is there a written schedule for exercising the plan? Is the frequency adequate? Has the plan been exercised in accordance with this schedule? Are the exercises devised to determine the adequacy of the plan, or just to show to the executive and corporate audit that the plan has been tested?
Are the exercises comprehensive, i.e. are various parts of the plan exercised over a period of time or are the same sections exercised each time? If the exercises start 'small' do they increase in complexity? Are the exercises fully scheduled such that all staff know in advance what is being exercised, or are some of a 'surprise' nature? Do the exercises include the interdependencies?
Is there a written report made of the exercise, does it compare the results against pre-established goals and standards? If the report indicates one or more deficiencies in the plan are these evaluated as to the likely effect on the ability to recover? If the deficiencies are seen as serious has action been taken to correct them and has the plan been updated accordingly?
Are the critical staff rotated in the exercises so that a number get to take part over a period? Is the executive directly involved in the exercises? Are the recovery teams alerted and included?
Maintenance of the Plan
Is there a schedule for review and maintenance of the unit's plan? Is the frequency adequate? Does the schedule include provision for out of step review depending on major changes to the unit; function, processes, staffing, line of business, hardware/software requirements etc.? Has the plan been reviewed in accordance with the schedule? Who has reviewed the plan, was it the same level of staff who developed it? Does the review compare the plan with the original Business Impact Analysis? Once reviewed is it again signed off by the responsible executive? Once reviewed do current copies replace the older ones, ALL the older copies?
Does the Plan make Sense?
One unexpected advantage of an audit is that it is conducted by someone not directly involved with the development, maintenance and exercising of the plan. Such an individual has the opportunity to view the adequacy of the plan from the aspect of 'does it make sense'? Is the plan realistic, or just an exercise to show to anyone who shows interest that a plan exists, relying on the naivete of the questioner not to be able to determine the real adequacy and practicality of the plan and hoping that a crisis will never happen.
An audit well done is an invaluable tool in the whole business recovery planning process and should be used as such rather than seen as a nuisance.
Conversely to be of consequence such an audit should be well thought out and implemented, viewing the need for a realistic Business Recovery Plan as vital as the financial stability of the corporation.
No matter how financially sound a corporation may be if it cannot recover in good time following a disaster, through lack of a well developed Recovery Plan, it may well be faced with an inability to recover at all.
Damom Arber, MBCI, is the manager of contingency planning for Corporate and Treasury Divisions of the Bank of Montreal.
This article adapted from 10#1.
Effective disaster planning is systematic. Good plans are purposeful, methodical and, above all, built on a firm foundation. The best framework for plan foundation-building is a careful and complete risk analysis. Risk analysis attempts to identify the conditions that can lead to disastrous outcomes, and their relative likelihoods. By reasoning through the possibilities, the disaster planner gets a better idea of what's important. He or she also gains a valuable understanding of the mechanism of disaster, resulting in more useful plans. This is in contrast to the 'be ready for anything' philosophy espoused by some planners. A scattershot approach can result in a serious lack of focus that may actually hinder an organization's ability to effectively respond to disaster. In practice, most planners do prioritize planning on the basis of at least some rough estimate of the likelihood and costs associated with possible disasters. I might, quite rationally, choose to dispense with earthquake planning in a siesmically inactive area of New England. To a planner in Southern California, on the other hand, earthquakes are a major concern.
With its roots in the analysis of safety critical systems, like nuclear power plants, the scenario- based approach is an amalgam of formal methods. These include ideas from systems engineering and the theory of probability. To be truly useful, however, risk analysis must be easy to apply in practice. The perfect blend of rigor and simplicity is provided by a scenario-based risk analysis. A scenario-based analysis helps us develop a detailed analysis of disaster potential by providing a logical structure for the analysis. We make the logical structure of a scenario-based approach easy to develop, use and understand by integrating an intuitive graphical structure, in the form of flow charts.
This guide to systematic disaster planning is divided into two parts. In part 1, we describe a simple method for the formal analysis of disaster potential based on flow charts. Once completed, this analysis can serve as a rational basis for plan development and testing. This process is described in part 2 of the series. Taken together, these parts describe a ready-to-use methodology for effective disaster planning.
The Nature of Disaster
Disasters don't just happen. They develop through a dynamic chain of events. This chain always starts from some initiating event. The initiating events of most concern to modern disaster planners are things like fires, earthquakes, windstorms and chemical spills. Properly mitigated, outcomes stemming from these initiating events can turn out to be relatively minor. For example, when sprinklers act to quell a fire at its incipient stages. Other times, initiating events follow to serious, adverse outcomes ... in a word, disaster.
What governs the path from initiator to outcome is the idea of randomness, or chance. We don't, and can't, know for sure what will happen next in the chain of events. We only have some idea of each events relative likelihood, or probability. The concept of randomness that governs the process can be illustrated using simple gambling devices like cards, dice or coin tosses. We flip a penny into the air (introducing 'randomness') and then guess 'heads' or 'tails'. The nature of the process is well known. What we can't know for sure is whether the coin will land heads or tails. We know from the physical properties of the coin, as well as past experience, that a fair coin will land heads one out of two times in repeated tosses. The probability of heads is therefore 1/2 or .5. This number serves as a guide to action (e.g., when placing bets in a gambling situation) as well as an indicator of how 'expected' the event is to occur on the next try.
The chain of random events that make up the path to possible disaster can be conveniently visualized using flow charts. Most of us are familiar with flow charts. They provide a schematic representation of a sequence of events, and their outcomes. By allowing us to visualize the flow of events, flow charts give us a better understanding of the underlying processes and how they all fit together to make up the systematic whole. They also provide a structure for the systematic calculation of event probabilities. To properly respond to disaster, we need to identify possible disasters, and assess their likelihood and consequences. Flow charts help us do just that.
Creating Disaster Flow Charts
Getting a flow chart for potential disaster on paper is simple: As a general rule, if you can think of a scenario, you can flow chart it. Disaster flow chart creation starts with the 'brainstorming' of possible scenarios arising from some initiating event. The results can be captured, initially, in the form of a narrative, or story. The various scenarios developed by this 'thinking out loud' method are then plotted in flow chart form. Flow charting has the advantage of helping us to better visualize processes which may be obscured by words alone. They also provide us with a structure on which to base probability calculations, For planning purposes, scenario outcomes can be prioritized according to their probability/ consequence characteristics.
The figure on the following page shows a simple flow chart of the disaster potential of a firm engaged in the transport of hazardous chemicals. The initiating event here is a truck accident. To begin, we need to get an estimate of the likelihood of a truck being in some kind of accident during the course of a year. Company statistics show that this occurs roughly once every 5 years resulting in a probability estimate of 1/5 or .2. Now we just follow the logical progression from initiator to next steps. When a tanker truck is in an accident it can either spill its cargo, or not. Note that, in reality, the event 'cargo spill' can range continuously from 0 to the total load of the truck. Usually, one or a few options can capture the essence of events. We might, if we wanted to be a little more precise, expand the spill event to include minor, moderate and major spills, for example. Doing so, however, complicates the analysis. How complex we make a tree is a judgment made by the analyst, with the purpose of the exercise in mind. In many cases, even a very simple analysis can provide great insight into the process.
Using company records, as well as industry experience, we find that the probability of a spill given that a truck accident has occurred, is around 1/10, or .1. This means that, on average, we can expect one out of every ten truck accidents to result in some kind of spill. One branch of this event 'tree' now becomes terminal: The truck has an accident, there is no spill, and property damage to the truck amounts to approximately $45,000. This branch represents a final outcome or end-state. To determine the probability of any outcome we simple multiply the probabilities of events along the way. For example, the scenario in which a truck accident occurs, the truck is damaged and no spill of cargo occurs has a probability of .2 (truck accident) times .9 (no spill), or .18.
Focusing now on the other branch emanating from a possible accident, we notice that a cargo spill itself can be followed by various events. A spill can result in loss of cargo only, resulting in substantial clean up costs, a fire or, in the worst case, a fire and explosion. These events are represented by a further branching of our tree. Using expert opinion and perhaps some actual accident data we determine the mutually exclusive probabilities of the events that occur given that an accident has resulted in a cargo spill. The most likely result is a spill with no fire. This happens 90 percent of the time when an accident initiated spill occurs. There is a far lower chance (.099, or roughly one in ten) that the spill catches fire. Should the spill catch fire, the results are serious. Monetary damage to persons and property can run as high as $1,000,000. At this stage of the analysis we are looking at outcomes that could truly be labeled as 'disasters', at least from the perspective of our transporter. In some rare cases (one in a thousand) the cargo can actually explode. The resulting damage of this outcome is $2,500,000. Once again, to determine the probability of these final outcomes we multiply the probabilities along the tree. For the worst case scenario of a truck accident that results in a cargo spill that ultimately catches fire and explodes (causing $2,500,000 in damages) is .2 x .1 x .001 = .00002. This amounts to a probability of two in one hundred thousand. We can look at this number in terms of annual event frequency - we expect two such events every one hundred thousand years of operation - or as the probability of such an event occurring this year among a population of one hundred thousand similar firms (we would expect two of these to suffer a $2,500,000 disaster).
While some of these numbers appear imperceptibly small they become more tangible when we look at them from the perspective of the collective. In a group of one thousand entities, each facing a seemingly small probability of disaster of 1 /10,000, or .0001, we are virtually assured that at least one of these will face some serious event within the next ten years. The question for the planner is: If that firm is yours, will you be ready? This is where a systematic plan for disaster recovery comes in.
As noted above, the do-it-yourself potential of flow charting is high. Initial tries can be carried out with pencil and paper. Added structure, and a neater appearance, can be gained through the use of one of the many computer flow charting programs available. Often, flowcharts can be set up using computer spread-sheet programs. These permit the rapid calculation and recalculation of event probabilities as well. The flow charting of disaster scenarios is very much a learn-by- doing exercise. Computer tools make this learning process all that much easier.
The need to establish probability estimates is perhaps the most daunting task in creating a good flow chart analysis of disaster potential. Statistical data is usually very limited. Expert judgement can often be substituted for data, with good results. When uncertainty enters, it can be communicated using interval estimates. For example, we may estimate the uncertain probability of a truck accident as a range from one in three (1/3, or .33) to one in ten (1/10, .1). The width of this interval can serve as a measure of uncertainty. The analysis can then be run using 'high' and 'low' estimates, along with perhaps a 'best guess' (in our example, 1/5, or .2). When uncertainty exists it is important that it is adequately captured. What we don't know can be as important as what we do know.
It is, of course, axiomatic that we can't capture every possibility in our charts. This is no reason, however, for us to not at least make the attempt. If done properly, we can take comfort in knowing that most, and the most serious, disaster scenarios facing our organization will be properly accounted for. It is only the most pessimistic among us that can genuinely believe that nature somehow conspires to present us only with those disasters that we have failed to account for.
Using the Results
While this example is highly simplified, it does bring out the points of value in a well thought out analysis of possible disaster scenarios. We gain a deeper understanding of how the process proceeds, as well as an estimate of the probabilities of various outcomes along the way. These probabilities allow us to prioritize our recovery and planning efforts. In the happy case where the probability of disaster is virtually nil, or where the consequences of an unexpected event are relatively minor, we might dispense with such preparation altogether. This frees resources for other uses. For more serious situations, the charts themselves serve as a framework for action. We leave a more detailed description of how this may be accomplished for part 2 of Dynamic Disaster Planning: From Ideas to Actions.
The branch points along a scenario 'tree' also provide us with guidance as to where and how the probability of disaster could be mitigated. For example, disaster probability could be greatly reduced in our example by increasing the probability of early notification and successful evacuation. To the extent that all reasonable actions can reduce this probability no further, we can at least go into future and use decisions with an idea of the risk involved. These may, or may not be, acceptable. At any rate, further damage could be mitigated with an effective plan of disaster recovery. Financial damages may be addressed with insurance or the sharing of community resources (e.g., disaster relief).
Scenario-based analysis of exposure to adversity using flow charts can be applied to a variety of perils at the enterprise, societal and even personal levels. While these perils may be very different in each case, there is a commonality in terms of the 'flow' from initiator to outcome that the graphical approach captures so well. This means that the knack for developing flow charts, once gained, can easily be applied to many different exposures. Flow chart analysis is also very modular, in the sense that we can start with simple representations and build from there. This allows for incremental construction of charts as the need for more detail arises. Our truck accident analysis, for example, could be expanded to identify the effectiveness of different types of evacuation and notification processes. Detailed flow chart analysis can also be focused on a particular event along the tree.
Systematic disaster planning starts with an understanding of the causal mechanisms of disaster. An easy and effective way to gain this understanding is through the construction of scenario flow charts. Is it worth the effort? Much disaster planning today is based on a 'seat-of -the-pants' approach. Indeed, informal analysis based on planner's intuition of disaster potentials has generally been rather successful. The problem is that the world isn't getting any less complex - only more. This means we have to keep one step ahead of the potential for disaster in our planning efforts. To do so, we need to introduce more formal methods of analysis - like the scenario-based approach to risk analysis. So the question is really not whether we can afford to introduce a more formal approach to disaster planning, but rather, how can we afford not to?
A disaster plan is a design, or blueprint, for action in the face of adversity. To be effective, the plan must be well thought out, taking into consideration the many complex factors that comprise a disaster.
In part 1 of this two-part series, we suggested a rigorous, yet simple to apply, method for the analysis of disaster based on flow charts. By flow charting the course of disaster we gain a detailed understanding that forms the basis of a systematic plan.
In this article, we discuss in more detail how a scenario-based risk analysis using flow chart techniques can be incorporated into the planning process. Also discussed is the critical phase of plan testing. Testing serves as more than just a means to make sure that our plans are working as intended. It can give valuable insight into the process that can be 'fed back' into the plan to improve performance. Iterative construction of systematic disaster plans - through a cycle of development and testing - add to the assurance that the plan will work when we really need it.
How to Plan
There currently exists a huge volume of articles, books and seminars on the process of planning for disaster. These provide a wealth of knowledge on the planning process. Experts in specialized areas such as off-site computer system backup, restoration of fire damage, public relations, access to replacement machinery and equipment, and many other crucial to the recovery effort provide information on coping with the effects of disaster in the most efficient manner. There are also available many case studies of actual disasters that detail how response were handled, and perhaps more importantly, how they might have been handled better. In this way, we learn from past disasters how to better cope with future ones.
Perhaps the greatest aid to disaster planners is the computerization of the process using a variety of software tools. These tools range from word-processing based programs that help us format, maintain and distribute our plans to true 'expert systems' that integrate the knowledge of disaster planning experts in an effort to provide guidance to beginning planners.
Disaster planning can get complicated. The speed, accuracy and memory capacities of modern electronic computers greatly reduce the attendant complications of planning. Most disaster planning involved is provided by the flow chart analysis.
After the need for action has been determined, disaster recovery planners as well as those in charge of the operation of our tanker fleet can be made more directly involved. Here are the 'paths' along our flow chart that we are concerned about: Now, how do we handle them? To help answer this question, we incorporate all that specialized knowledge of disaster recovery planning that exists.
We also identify any blind spots for which information might not exist, and attempt to create it from scratch. These pioneering efforts will, in turn, help those that may face similar scenarios in the future. Implementation of the focused plan is now accomplished using a variety of aids, including computer programs that let us capsulate the plan and make it readily available for when the need arises.
Once planning methodologies have been mapped to potential disasters there remains the not-so-trivial aspect of linking planning to the resources available to our specific organization. In this final phase of the process of systematic disaster planning process, we identify who will be responsible for the disaster recovery process at each stage outlined in the plan. The 'who' of disaster recovery planning includes disaster planners proper, organizational resources including the department or departments affected, and outside service providers.
Lining up the proper outside resources is critical. Disaster planners soon find that the organization can not do everything by itself, especially when potentially crippling disasters strike. Restoration companies, equipment vendors, alternate site providers, and others make up an essential part of the disaster recovery process specific to the organization. Whenever possible, these outside resources should be privy to the planning process (or at least the portion that involves them), so that they may offer constructive input based on their knowledge and experience.
The importance of teamwork among internal resources goes without saying. Assurances of commitment and competence in area of assignment should be a part of the systematic disaster recovery plan. Perhaps above all is the requirement of senior management commitment to the process. As this management is ultimately responsible (at least ethically, and often legally) for effective disaster recovery, this commitment should be readily forthcoming in most organizations. The next line of authority falls to the disaster planning organization.
This group, which often consists entirely of operating personnel, is responsible for direct administration of the plan. All members of the disaster recovery team are important, and they should be recognized as such.
Who is responsible for what in the disaster planning organization can be detailed using a variety of techniques within the plan. Responsibility charts and 'call lists' are a part of every comprehensive planning effort. By attaching names to duties, and by obtaining the individuals commitment to these duties, we literally make the plan come to life.
Computerization becomes indispensable to this part of the process. The volume, complexity and dynamics of the interrelationships mean that we will need to develop efficient planning aids that can quickly respond to changes.
Chart #1 shows how scenario-based risk analysis, planning techniques and organization-specific processes come together to form the core of a strategic disaster recovery plan.
As demonstrated above, all components are essential to the effective operation of the plan. And skimping on any of these components will result in a commensurate degradation of plan performance. Those responsible for the over-all planning effort must make sure that all the pieces come together.
An absolutely essential component of the disaster planning process is testing. Testing refers to the exercise of the plan under 'simulated' conditions. In effect, testing allows us to 'try out' our plans before they are actually needed.
The idea is that the worst time to find out your disaster plan is in some way defective is when you are faced with a real disaster. A scenario-based analysis can provide the structure needed for realistic plan tests.
Scenarios can be utilized for plan testing in a variety of ways. Most simply, we could run through the possibilities, i.e., the 'branches' of the flow chart, providing our test participants with a realistic representation of events constituting the disaster scenario under study.
This allows specialized testing while maintaining the simulation framework. Added realism can be introduced by simulating adverse consequences of initiating events based on their actual probability of occurrence.
Obviously the simulation would need to be sped up to compress the very large time frame within which small probabilty events occur to within a reasonable time period for study. This can easily be done by running many computer generated random numbers, based on the underlying event probabilities, through the flow chart and noting the outcomes.
If the flow chart has been set using a computer spreadsheet program this task is realitvely easy. Most spreadsheet programs have random number generators built in that can be used to emulate a variety of underlying probability distributions. In this way, planners can observe literally thousands of 'years' of experience in a relatively short time frame.
These outcomes would then be used as the cues to trigger the proper response. These random simulations, also known as Monte Carlo simulations, add a heightened sense of reality and excitment to the exercise.
The accompanying chart shows the outcome of 250 simulated 'years' of operation of a hypothetical transporter of hazardous chemicals. It is based on the probabilty numbers we developed in the risk analysis given in part 1 of this series. As we might expect, the most common outcome is 'no accident'.
When accidents do happen, most are minor (property damage only). However, the potential for disaster exists. This potential was in fact realized during our simulation, in year 205. There, an accident resulted in a cargo spill and subsequent fire. Damages totaled approximately $1,000,000.
An event chain such as this should trigger an appropriate recovery sequence.
Flow chart analysis can also be used to test plans on a more selective basis. A selective analysis can held identify 'blind spots' in the planning process.
For example, a branch of a flow chart could be chosen (perhaps at random) and the affected departments asked how they would respond. Unsatisfactory responses would indicate the need to bolster disaster planning in that area.
The information gained by such a scenario-guided mission is far greater than that obtained by simple questionnaires that ask 'Do you plan for a disaster?', or even 'How do you plan for disaster?'
Plan testing in this fashion also encourages a top-down approach to the management of disaster recovery planning. The outcomes of a secenario-based risk analysis show in detail the potential impacts of untoward events on the organization, and their relative likelihoods.
This information relates directly to the financial and operational management of the orgination. Serious outcomes that will surely peak the interest of senior management.
A natural response to potential calamities is 'What are we going to do about them?' Part of the answer comes from those responsible for the management of safety and the financial effects of such events.
Crucial is the response of those who will manage recovery in the face of disaster. When faced with these tough issues, the diaster planner must be able to reference a systematic plan for disaster recovery.
Mark Jablonowski, CPCU, ARM, is Risk Manager for the Hamilton Standard Division of United Technologies Corporation in Windsor Locks, CT.
This article is intended to depict the various stages of 'Disaster Recovery' activity that I felt necessary for us, at AMD, to follow. I will also cover some of the misconceptions, false senses of security, vulnerabilities, etc., involved with this subject.
We began this project by evaluating four different companies and selecting SunGard as our professional 'Disaster Recovery' company. I believe these types of companies cannot do the work for you but they are of major importance in organizing/conducting interviews, documenting, employee training, and assisting in developing the test and monitoring portion of the plan.
In other words, if I don't specifically mention them throughout this paper, I want it clear that they were an integral part of the process throughout.
'Disaster Recovery Planning' can be such a major, intimidating, costly,and 'dooms dayish' type of activity that no-one wants to be faced with the task. The attitude is, that we probably won't have a disaster, and if we do its going to be a long time off.
Another paradigm is, to qualify as a disaster, the whole area will be destroyed and there will be no manufacturing area, so who cares about the computer systems and information.
Disasters fall into two major categories, ' local' and 'regional.' Simply put, a local disaster is one that affects one building/location, whereas a regional can affect blocks, miles or counties.
The plan must accommodate the 'Worst Case' scenario which in the interactive world of Semiconductor manufacturing implies that 1) a secondary computer site is available, and 2) real time communications, whether it be 'LAN' or 'Wide Area' must be intact to the selected alternate computer in the event of any disaster.
A common disaster could consist of:
1) a mistaken or erroneous situation causing the fire sprinklers to go off over the computers.
2) a small fire caused by an electrical short
3) Lightning striking
Disasters do not have to be 100 year floods or eight-point earthquakes. All it takes to be a disaster is something that could mess up approximately 1,500 square feet of very important real estate.
The Data Center
During the BIA (Business Impact Analysis) process early in any disaster plan, no matter what eventual price tag is placed on the plan selected, considering all aspects of financial and intangible losses, overwhelming justification becomes intuitive. Keeping this in mind, 'Disaster Recovery Planning' should not be something that we decide if we're going to do, it should be something driven down from management demanding that this activity begin immediately, even if it takes additional personnel and funding to accomplish. 'Disaster Recovery Plans' are not something that is done and placed on a shelf until the BIG ONE hits.
They are living and breathing contingency plans that represent approaches to recover from all levels of failure.
Everything from a user requiring a file restored from backup because he inadvertently deleted it, to a major act of God like an earthquake or flood, to a stupid human error like a wiring short or a broken sprinkler. All of these need to be anticipated in a good 'Disaster Recovery Plan' It also must be designed to react to the ever changing application and hardware approaches.
The California CIM configuration addressed in this Disaster Plan is a centralized 24-hour, seven- day week VAX environment supporting approximately 1,200 concurrent users exercising a large 'Shop Floor Control' package. On a weekly schedule (Friday- Sunday), all of the 160 gigabytes of disks are backed up to tape, and on the following Wednesday the tapes are relocated to an offsite vault where they are retained per departmental retention schedule pending future requirements to recover lost data. Copies of these tapes are also retained on site in the computer center. Incremental back-ups are taken daily,
(Sunday-Thursday) but those tapes remain in-house, thereby vulnerable in case of a situation damaging the local tape repository.
Prior to implementation of the subject Disaster Recovery Plan, offsite backup tape storage was the definition of a 'Disaster Recovery Plan'. In many cases, this may be adequate, but realistically it implies that, at AMD CIM, for example, we could have been as much as 10 days out of sync with the data recovered. The amount of transactions necessary to recreate that 10 days of activity would be in the hundreds of thousands. This would constitute a horrendous manual effort, taking hours and/or days of precious time, resulting in a staggering impact on company revenues and ultimately customer satisfaction.
We, at AMD, subscribe to a Digital Equipment Corporation service referred to as 'Recoverall' which guarantees priority replacement of any damaged piece of hardware. Even with this service, replacement would take a minimum of one week. This equipment replacement solution does not solve the problem of where to put the replacements if the Data Center is extensively damaged or how to achieve adequate electrical power or network connectivity.
There are many approaches intended to respond to equipment/data center replacement. Some supply large vans with Air Conditioning and Power Distribution Units, others are passive duplicate computer centers on some other company's premises (preferably in another geographic area). In the case where vans are brought to your parking lot, power and network become the major stumbling blocks for reinstatement of the Data Center, either of which would probably take longer to replace than the Computer hardware itself.
The offsite Standby Computer Center is difficult to sustain because maintaining enough network bandwidth connectivity with this offsite center which would keep necessary crucial files current enough to be useful in case of failure, would require an extremely large expense for a capability that we hope would never be used. The other network consideration is, many regional disasters damage long-line connectivity which would render the Disaster Recovery approach useless.
The chosen approach, it's limitations and vulnerabilities has to be totally understood by all organizations involved. The user management must be aware of the potential recovery delays and the amount of work required on their part for recovery. The CIM management must be aware of the overall best and worst case response time through all phases of recovery.
This obviously implies that full functionality recovery will not be immediate and priorities must be established in advance as to what capabilities must be restored and in what order, i.e. engineering analysis surely would fall behind reinstatement of Work in Process scheduling activity (get the Factory running).
We have to assume that the goal of any disaster recovery plan would be complete recoverability immediately. Our previous (offsite tape storage) approach could take days, or weeks to reestablish the databases alone. As stated before, a BIA (Business Impact Analysis) must be performed to determine the threshold of pain acceptable, verses the amount of resources (money and effort) to invest in 'Disaster Recovery.' This has to be driven from the 'User' organization because their the only ones that can truly realize the total cost and value of losses incurred by unexpected downtime affecting the manufacturing process. Acceptable compromises will result from the above analysis indicating the worst impact (4 hours downtime maxi at AMD) that the manufacturing environment could endure.
That answer will direct the 'Disaster Planning' activity to a limited number of alternatives that will satisfy the requirement.
Although there are some commonalties, the solutions within our company are different, for several reasons, between California and Texas CIM organizations. This paper is only enlightening you on the California Disaster Recovery solution.
A Given: -The worst position a CIM support organization could be in is for the FAB to be physically capable of operating but the CIM computers are disabled disallowing manufacturing productivity.
If the FAB is physically damaged by a disaster, immediate CIM computer availability is of little consequence.
This philosophy mandated the decision to physically place the backup computer environment in a sub area immediately beneath the manufacturing floor (FAB). Theory is that if the FAB is unharmed, there is a good chance the Disaster Recovery computer is o.k.
There are several variables that come into play in deciding the proper Disaster Recovery solution for your specific site. Obviously, as mentioned before, Costs, Network capabilities, Equipment involved, and others. In a VAX world other considerations have to do with whether, or not, your environment is that of a single cluster, or if you are running multiple clusters.
In California we are running a single production cluster which gives some latitude in product selection. Consequently we were able to select what we consider to be the Cadilac of Disaster Recovery Systems. Because of our environment the Business Recovery System (B.R.S.) Package from Digital Equipment Corp. is a perfect fit. This package allows us to physically have the Cluster split between two locations (Data Center and Sub Fab) with FDDI replacing the Computer Interconnect logic connection.
The VAXCluster console is replaced with two Operation Management Servers (OMS), one in each location. Either of the OMS's can control and give visibility to either location or the entire cluster. The major benefit of this dual location configuration, as you may have guessed by now, is the ability to 'Shadow' (mirror for you UNIX people) all critical disks in both locations thus allowing the secondary location to be the primary user system whenever an incident is experienced with the main computer center or network. This takes place without the loss of one transaction.
Another major benefit, differing from the normal Disaster Recovery environment, is that the backup (sub fab) system can share as a partner in day to day productional responsibilities.
We elected to assign reporting activity to it because that would be easily suspended in the event of a disaster. The only downside is a limitation to the number of concurrent users in a disaster mode (+-100). This is strictly a capacity issue (3 computers replaced by 1) arrived at by user management and could be altered simply by adding computer horsepower.
There is an additional opportunity that appears to be a logical extensions while planning and implementing a 'Disaster Recovery Approach':
- Near lights out computer room with robotic tape handling seems to be a logical activity to pursue at this time because on-site vs. off-site storage of tape is a fundamental requirement and the more this can be streamlined, the more current offsite information will be, i.e. daily vs. weekly.
- Implementing a tape silo (jukebox) has several benefits which reduce operator interaction, thereby positioning us in line for a 'near' lights out environment. This itself has great merit, but in this paper I want to address the use of silo's in a Disaster Recovery role. I have not yet implemented this at AMD, but my vision is that silos will be located in each location and all key information will be duplicated in both locations. As long as the two sites are in the same proximity, offsite backup retention will continue to be a requirement but would be rarely utilized because the low probability of both sites being destroyed in a single incident.
- If your Disaster Recovery solution, however, included a geographical separation between your primary and secondary computer centers, offsite storage would not be a requirement thus a full 'Lights Out' DataCenter environment could be achieved. Keep in mind that each site is unique and your Disaster Recovery approach must be selected to compliment that individuality. For example, the data storage and recovery approach of a centralized computer environment, such as ours, differ drastically from a distributed environment (client-server), but the bottom line requirement is the same (complete recoverability in an acceptable period of time).
As we evolve from a centralized, monolithic to a 'Client-Server ' environment, all of the 'Disaster Recovery Planning' must be reevaluated. The goal of this paper is to convince it's readers that Disaster Recovery Planning is one of the most important activities that we should be addressing at this time, and for the foreseeable future. I believe that we not only have to be concerned about earthquakes, fires, floods and storms, the corporate world is becoming more and more a target of terrorist, disgruntled employees and other two legged dangers causing the occurrences of disabling disasters to become more and more prevalent.
To achieve 'WORLD CLASS' FAB Status, you must achieve 'WORLD CLASS' Disaster Recovery capability and I am confident we, at AMD CIM, have done that.
I must thank AMD's senior management for recognizing the need for a CIM Disaster Recovery capability, supporting me and my organization in the design and implementation of this plan, and opening the pocketbooks allowing us to implement a quality product.
Dan S. Perry has managed different Computer Systems support organizations in excess of twenty years. Currently he is the Department Manager over the California Systems and Operations organization within Information Technology Management (ITM).
So, you have been tasked with developing a business impact analysis as a prelude to your corporate business resumption plan. Congratulations, I am both happy and sad for you! Seriously speaking, this will be a most visible project with ample opportunity to shine in the eyes of executive management. On the other hand, it could backfire miserably on you if you proceed carelessly.
It's impossible to list every way people go wrong in this endeavor in a single article. Even so, we can hit some of the more common secrets of success to assure your business impact analysis (BIA) is accepted by management and is understandable enough to guarantee future support and funding for the project. Be forewarned! The BIA is often short-sold in the interest of getting the plan done cheaper or faster, only to cause undesirable consequences later. Don't let this happen to you. In this article we will discuss some common mistakes made in conducting a BIA, centering principally around who does it, how, with what tools, and most importantly, how it is presented to decision-making executives.
If you have been given the responsibility for conducting a BIA, chances are your boss raised your budget, lightened your work load, and perhaps even let you go out and hire one or two bodies to assist you in the project, correct? Yea, right! In actuality, the worst case is probably true. You will be expected to add contingency planning to your 'to do' list and knock out a plan in your copious spare time.
This leads us to the first major mistake most corporations make in putting together their BIA, as well as business resumption plans in general. More often than not, responsibility for developing a long term recovery plan is placed squarely where it does not belong, in operations departments. This is not to say that operations managers are not qualified to pursue this goal. But think about it, when was the last time you found an operations person in your organization with extra time on his hands? That means business resumption planning becomes 'kitchen table' work which gets taken home at night as it certainly does not fit inside the confines of a today's overwhelmed operations personnel.