DISASTER RECOVERY PLANNING
Why is it that most data processing departments dont have disaster recovery plans? Probably for some of the same reasons you
dont. Some typical excuses include Not enough time, It isnt going to happen to us, and Its not in the budget. Sound
familiar? If you have used one of these excuses lately and still dont have a disaster recovery plan, then you are not alone. Truth is,
most data centers lack the expertise to develop a good contingency plan, so they use most of their creative energies developing
reasons for not planning, instead of creating the plan.
This article recaps the steps involved in the development of a contingency plan. It is based on a project conducted by the Association of Contingency Planners (ACP), Orange County Chapter in 1987. The chapter educated and guided a volunteer company through the steps involved in creating a real contingency planning capability. Basically, the experience began by having the company give a presentation about itself. The chapter then formulated a plan for guiding the company though contingency plan development stages. At each meeting, speakers on selected subjects presented information on a topic in which they had a high level of expertise. The company was then given specific advise and instructions by the speakers with the ACP members making additional recommendations. The company had until the next meeting to complete that phase.
At the following meeting, the company would present their progress by sharing documents or information with the group. (Again, speakers dealing with the next phase of the project presented information, and the membership made specific recommendations to the company.) This continued for six sessions until all the ingredients in a contingency plan had been covered. The experiment was an unusual opportunity for members to experience first-hand the realities of developing a plan. It was assumed right up front that real-life problems would occurthey always doand it was the problems that made the experience worthwhile. By going through the process, step-by-step, members learned that a contingency plan is a formidable task. They also learned some of the stumbling blocks that can make it difficult, if not impossible. By following each step in the series, association members also had the opportunity to develop their own plans while having access to experts at each step.
In this article you will learn the realities of contingency planning by going through the process step-by-step. The experience wasnt perfect, but we learned a great deal from it. We believe it is worth sharing and sincerely hope you will learn a great deal from it too.
Keep in mind that your requirements may vary. What we developed was for a particular company, in a specific industry, with unique needs. This article is not designed to teach you everything you need to know about contingency planning, but rather to recap our experience and point out some important factors. It is impossible to reiterate all the discussions that occurred at the meetings, and you missed some good ones, but notes taken by Mr. Joe Hernandez and materials provided by speakers have been used to develop this article. Did I forget to mention there is a surprise ending? I hope you will stick with me until the ending--it's worth it.
The ACPs Executive Committee who organized and oversaw the project, divided it into six sessions.
Session One covered:
(1) An Overview of the InsurAll company,
(2) Planning Steps and Who is Involved, presented by Mike Noyes of the Automobile Club of Southern California,
(3) Application Priorities, presented by David Williams of Knottss Berry Farm, and
(4) Risk/Impact Analysis, presented by Greg Staininger of Computer Risk Management Company.
Session Two covered:
(1) Why Bother, It Wont Happen on My Shift, presented by Tom Wilson of Rockwell International,
(2) Hardware and Software Backup Strategies, presented by Kristin Kiefer of Digital Equipment Corporation (DEC), and
(3) Telecommunications Strategies, presented by Jim Stratman of Total Asset Protection.
Getting To Know InsurAll
Before making recommendations for courses of action, the group had to get to know the company. To make things easier, Im going to refer to the company as InsurAll. This name is completely fictitious and any similarity to an existing company is purely coincidental. InsurAll, as its name implies, is a company that administers insurance policies. They do not directly insure clients, but administer the policies for a percent of the policy premium, or by charging a specific fee based on the type of service they are providing. The administration of policies involves, negotiating fees with hospitals, physicians and ancillary services, independent reviewing of medical treatments to determine if the care prescribed is necessary and reasonable, encouraging the use of generic drugs, and other services that help to keep costs low while maintaining a high level of quality medical care. Most companies find this type of service is an effective way to cut their companys medical related costs.
InsurAll started operating a Digital Equipment Corporation computer about a year ago and process more than 50,000 claims monthly. They operate 12 hours daily, from 6 a.m. to 6 p.m., Monday through Friday, and have about 50 on-line users at their location. They currently have three major software systems that are critical, meaning they must be able to recover within seven days of an outage, or face stiff financial consequences. They have manual procedures for back-up, but their ability to use them diminishes as new employees are hired and experienced employees lost.
At our first session, InsurAll was asked detailed questions about their operations and computer system, some of which were proprietary in nature, so if it seems you are not getting all the information about the company and the contingency plan, there is a reason. (Hopefully, the information mentioned earlier covered enough basics about the company to understand the nature of their business.) Every company has different needs and requirements. However, the planning process is basically the some. As a matter of fact, the planning process is much the same as other projects. Whether you are building a house, an information system, or a contingency plan, the basics are similar. Think of your contingency plan as you would any other project and it becomes less frightening and easier to manage, because you already know the steps you need to go through to get the job done. With a little more specific information about contingency planning, the task becomes just another project.
Who is Involved in Your Plan?
Before getting started on the plan, InsurAll and the ACP chapter needed to define who would be involved in the development of the plan. Usually the task is left to an overworked programmer who wont be with the company long enough to get the job done. For InsurAll, the MIS Director, Manager of Systems Support, Manager of Computer Operations and the MIS Directors assistant were primarily responsible for the task. Our first guest speaker, Mike Noyes, pointed out that a good contingency plan would eventually involve several other people. Once the scope of the problem had been defined and recovery strategies examined, a presentation would bring top-management into the planning process. Additionally, company personnel such as purchasing, the company controller, building maintenance, security, the entire MIS staff, and system users would participate at different levels to give the plan a testable capability. Outside of the company, as needed, vendors of contract support services would also be an active participant in plan testing and maintenance.
Defining Your Contingency Needs
One of the greatest misconceptions about contingency planning is that the manual or written document produced is the plan. It is actually a recovery manual with procedures, responsibilities and critical information needed to execute the recovery. The manual is a critical part of the plan, but a good contingency plan involves more than a written document. It involves people, procedures and a commitment from management. Without managements support for a strong contingency capability, the plan is doomed to mediocrity. It will be better than nothing, but it wont be as good as it could be. You cannot get management support until you define the problem, document the need, and develop a recovery strategy.
The first step in developing a contingency plan involves understanding the nature of your needs. How will you present a request for budget or justify expenses without first knowing the extent of the problem. You must get-a-handle on the problem first. You certainly cant write a plan without knowing what you are planning for. By examining the problem, you will determine why you need a plan, and how much of a plan you need. Both of these elements are critical to your success.
One method used by experts to document the need for a contingency plan is to perform either a Risk Analysis or an Impact Analysis. These are sometimes done together in a Risk/Impact Analysis. Greg Staininger of Computer Risk Management Company presented Risk and Impact Analysis to our group.
Greg defined Risk Analysis as a subjective process that identifies levels of risk and compensating levels of protection for vital assets to prioritize vulnerability qualitatively or quantitatively. The Risk Assessment examines three major areas: (1) Assets Valueprioritize business assets to be considered and determine their replacement value, (2) Threatsprioritize threats and determine a probability of occurrence, and (3) Protection Costs divided into prevention, defense and recovery, and evaluate the levels of each type of protection. By using these factors, you can calculate an Annual Loss Exposure (ALE). If the costs are small and the ALE high, you should be able to justify the need for protection systems such as water detection units, smoke and fire detection and suppression systems, new and better facilities, etc.
The Impact Analysis, on-the-other-hand, is a profile look at the anticipated consequences of being without a computer system to support critical functions for a prolonged period of time (prolonged may mean seconds to the stock exchange and days to a bookkeeper). An Impact Analysis defines the Maximum Allowable Downtime (MAD) of the computer before the downtime begins to have a significant impact on a companys ability to function. This is usually accomplished through a questionnaire distributed to computer system users. The questionnaire is used to answer questions such as:
(1) How long they can function without the computerin hours, days, or weeks,
(2) What the anticipated impact will be on the company if they cannot functionlegal, financial, etc.,
(3) Whether or not they have a manual system that they can fall back on, and
(4) If there is any additional information that you could provide them with that would help them function during a computer outage, or would protect them against unauthorized database changes.
The questionnaire is obviously subjective, but it does give you some basis to work fromyour questions will vary and reflect the nature of their business. The questionnaire can also be filled out by a MIS person who interviews the main users of computer systems. You will get back more of your questionnaires this way and be able to evaluate responses against a standard. What you want to determine is how long the user can function without the computer.
Normally, you would use their responses to develop a recovery strategy based on their apparent need to recover. That is to say, if they must be up in three days, a strategy to recover in three days maximum would be developed. But what if their downtime could be extended to a week, two weeks or a month? The recovery strategy might be significantly different. The key here is to take the interview a step further. Try to determine whether there is support information that you might be able to provide them with, that would help them extend the time in which they could function without a computerlike a report, or more information on an existing report. A local company does something similar to this nightly, by creating a spooled printer file of critical reports that could be used if they lost their computer. They save the spooled file each night to a back-up tape and store it off-site. If they were to ever need the reports, they could simply have the spooled file loaded onto another system, and the reports printed and distributed to the appropriate users. This changed their critical recovery need from one day to almost thirty, changed their entire strategy, and saved their company many thousands of both hard and soft dollars. The Impact Analysis also helps to prioritize systems for recovery, increases the visibility of computer dependency, and makes everyone more aware of the overall impact of losing computer support.
Both Risk Analysis and Impact Analysis can be difficult processes for the novice. In most cases, especially in the small DP center, all you will need to perform is the Impact Analysis, with the exception of threat identification found in the Risk Analysis. A full Risk Analysis is usually done when a corporate contingency planning study is undertaken. And in the case of InsurAll, the Impact Analysis was fairly simple and performed by their own MIS department. If you want someone outside of your organization to perform your Risk or Impact Analysis, there are several consultants available and their fees vary greatly, but average about $10,000. There are Risk and Impact Analysis software programs available that can help you by crunching the numbers, and printing graphs and charts for management. It is much easier for management to understand risks and impacts when they are visual, and your chances for support will be much greater.
Once you have completed your Impact Analysis, analyze the results to determine how long you can afford to be without computer support, the priority in which you will recover your systems, and the types of threats from which you are going to protect yourself. The InsurAll company found they had to recover three of their major systems within seven days, while several smaller systems could wait up to two or three weeks. They also determined the greatest threats to their system, which included hardware malfunctions, fire, employee sabotage, water and minor earthquakeit is unrealistic to believe that a small data center could ever get the budget to survive all potential disasters.
Objectives, Assumptions, and Limitations
The results of InsurAlls Impact Analysis were then used to develop Objectives, Assumptions, and Limitations. These three areas are important to identify in writing.
The Objectives let everyone know what the plan IS designed to do, and what it IS NOT designed to do. By stating the objectives, you can determine whether or not they meet top-managements organizational goals. If it is a primary goal of top-management to provide the best customer service in the industry and they believe this is do-or-die, your objective of a seven day recovery may be in trouble. By stating your objectives, you protect yourself by having a specific point against which you can make strategic decisions.
The Assumptions state right up front that the plan is designed to work within certain guidelines, and specifically what those guidelines are. Occurrences outside of the guidelines may result in additional downtime. For example, you may assume that the minimum number of personnel required to recover the system will be available. If you are planning on using a hot-site, that it has survived the disaster and will be available. You assume that support services have also survived the disaster such as plumbers, electricians, your computer vendor, communications vendors and others. Dont assume too much, or your plan may not be good enough to survive even the slightest glitch, and dont assume too little or you will be expected to survive anything and everything.
The Limitations tell management that the plan is limited to recovering computer support and is not designed to Save the Company. If a disaster of major proportions occursyou are not writing the company contingency plan. Yours is MIS only! Of course, after writing such a great MIS recovery plan, you will probably be assigned the task of writing the company planother tasks as assigned, right?
Based on InsurAlls findings, Objectives, Limitations, and Assumptions were written and presented to the members at the next monthly meeting. These were used in developing their recovery strategy.
After the Impact Analysis was finished and the Objectives, Assumptions, and Limitations written, a recovery strategy had to be developed that could get them back onto the computer system in seven days.
This part of the project involved presentations on contingency planning approaches, support, awareness, planner characteristics, planning pitfalls, alternative hardware back-up strategies, media storage and retrieval systems, and communications recovery options.
Contingency Planning Approaches
We began the second session with a presentation by Tom Wilson, Manager of Contingency Planning at Rockwell, who covered more of the basics. He defined three common approaches to contingency planning:
(1) the Ostrich Approach,
(2) the Gentlemens Agreement, and
(3) the Theoretical Concept.
The Ostrich Approach is the most common approach. It is one of risk denialIt cant happen to mebut if it does, I can handle it. The Gentlemens Agreement is just as bad. It is a misunderstanding of needBut you said your company would help me! The Theoretical Concept can also be disastrous. It consists of undocumented and untested ideasI have things under control. We know what to do. Not to worry. Another simple approach is analyticalthere is a low probability, and the cost is high which means you do nothing. Why bother? It wont happen on my shift.
As I mentioned earlier, most of these excuses stem from one basic premise people dont know what to do. To make things easier, contingency planning was broken into ten main components that simplify planning requirements. They include: Policy, Management Participation, Offsite Storage, Telecommunications, Hardware/Software Considerations, Capacity Planning, Critical Systems, Staffing, Documentation/Procedures, and Testing. In addition to the ten, he pointed out that maintaining the plan was an important issue one that will be covered in more detail in the third article.
One of the key elements in his presentation was his use of the term Recovery System. He didnt refer to all of these components as comprising a manual, but rather as being part of a system or capability. There are components to disaster recovery planning that are outside of the manual. They are departmental procedures and practices that occur daily. They are a part of a good contingency plan, but do not have to be in a recovery manual. For example, backing up your system to tape or Hot-Disk is just good business practice and an integral part of your contingency plan. This should be a part of your everyday routine, not something special. Having adequate backups is critical to your recovery, but who needs pages and pages of backup documentation in your recovery manualthat should be taken care of elsewhere.
Tom also emphasized that an important part of getting a plan off the ground would involve creating an awareness of the need for a contingency plan. One way to increase awareness is to test the ability to recover a systemthis should be scheduled in advance, and all appropriate personnel should be notified. Work with the users to create manual procedures to be used during the downtime. Test their downtime procedures before doing a live-test just to make sure their manual procedures work well. If their manual procedures dont work, you might not be able to afford the downtimeand it is much better first time through to have the ability to bring the system up if necessary. By performing a simple backup test, you will begin to create a general awareness of the need for both the contingency plan and manual backup procedures.
The Contingency Planner
The presentation also included a rather interesting thought. What characteristics should we look for in a contingency planner? He summed up the ideal candidate as technically knowledgeable. The person should have a solid financial understanding of company requirements and be able to conceptualize how to use all available resources. He needs to possess personal characteristics which include patience, a sense of humor and excellent communication skills. Excellent writing and planning skills are additionally important.
It was also pointed out that several planning caveats should be avoided. He warned against grand stand approaches to enlist management support, and especially the use of scare tactics. Instead, ensure a solid understanding of cost and manpower implications of the proposed solutions. Rather than merely looking at risks and identifying critical systems, search for enabling capabilities that can be implemented to facilitate recovery. He also warned against allowing contingency planning to become confused with security and control. You must understand your mission and you cannot expect management to support mutually exclusive objectives. You must let management know that, when it comes down to the bottom line, contingency planning is a good business decision.
One of the most critical components of a recovery capability is offsite storage of your system and vital records. Simply put, you cant recover what you dont have. Our guest speaker, Kristin Kiefer of Digital Equipment Corporation, spoke about offsite storage as being the backbone of any disaster recovery plan.
According to Kristin, there are some important elements to look for in an offsite storage facility. The facility must be physically secure from intrusion, and strong enough to withstand natural hazards. It should be located near fire and police stations. The facility must be environmentally controlled with the temperature between 60 and 70 degrees, and the humidity between 40 and 50 percent to prevent damage to tapes from heat, dust particles and condensation. The pickup and delivery vehicles used should also be environmentally controlled, in good working condition and be equipped with a modern fire protection system such as Halon. The employee delivering and picking up the containers should understand the importance of their tasks and be professional. The containers in which the offsite medium is stored should be well constructed to prevent damage to contents if dropped, and provide added protection against dust and water. The facility should provide easy access to your backups, 24 hours a day, seven days a week, and allow you to audit your backups with 24 hours of notice. In the event of a disaster, the facility should have the capability to transport all your offsite storage containers required for recovery to the recovery center in a secure and timely manner. It should have an effective inventory management system, and you may want to tour the facility before signing on the dotted line. Find out how long they have been in business. Ask them for client referencesif they give them to you, you may not want to do business with themthey may give out your name to someone else and that level of security is unacceptable. Find out what other types of goods are stored at that locationare they in the business of storing vital information, or will they take anything. Drive through the local neighborhood to find out what types of businesses are located nearby the offsite facilityis it in risk of a potential disaster caused by its neighbor.
InsurAll already had this consideration under control. Like most data centers, they regularly backed-up data daily, weekly and monthly as needed and the data was stored offsite at the DEC offsite storage center.
One of the most difficult decisions facing InsurAll, was deciding what to do about computer hardware. Since InsurAll was a DEC shop, and still is, we asked Kristin Kiefer to help us develop hardware backup alternatives.
Basically, InsurAll had three choices. They could use a vendor hotsite where they would be able to recover within one or two days. They could use a vendor coldsite in which case they would have their surviving equipment shipped to the coldsite, and wait for any additional equipment to be shipped there by either DEC or a second party. Or, they could create their own coldsite in an adjacent building currently under their control, and handle the equipment problem much the same as with a vendor coldsite.
Each choice has its advantages and disadvantages. For example, hotsites can be costly, and in the case of InsurAll, might be considered overkill. InsurAll can wait as long as seven days to recover without serious consequences. Coldsite, while being less expensive, are usually unable to guarantee the arrival and installation of equipment within a week. At best, it is a gamble, and during a complete recovery, the stakes are very high. Still, there is a certain level of reality we are all faced withit is dollars versus risk. Do the costs outweigh the risks? In less than a month, the decision was made.
For InsurAll, the decision became relatively simple after examining the facts and researching the options. They decided not to go with the vendor hotsite or in-house coldsite for the following reasons. First, InsurAll could not afford a hotsite. It simply wasnt in the budget and approval would have been very difficult. With the other options available, it would have been viewed as overkillrapid company expansion made budget dollars for anything but the most important items very tight. Second, an in-house coldsite was impractical. They didnt have the room to dedicate to a recovery center that may never be used. InsurAll has expanded to the point where facilities were already being over-utilized, and there were rumors that the MIS division would be relocating to another building in the near future.
Instead, InsurAll opted for the vendor coldsite. Without trying to sound like a DEC salesperson, the reasons were as follows. First, InsurAlls offsite storage is through the DEC offsite storage program, so their backups are already within a close proximity to the recovery location. InsurAll also was a DEC ReCoverall subscriber. ReCoverall is a new service available to DEC customers who have their service contract with DEC. Basically, it guarantees the subscriber company the first equipment off the line, in the event of a disaster, thereby shortening the recovery period. It pays for cleanup costs, refilling of Halon cylinders, hotsite or coldsite notification and daily usage fees, and there is no deductible or depreciation on replacement equipment. The cost is about 9% of their field service contract. For InsurAll, the service costs about $200 monthly. (For more information about the ReCoverall program, call your local DEC representative.) This meant that the equipment would arrive at the coldsite within 72 hours and InsurAll would be up and running within five dayswell within the time requirements of seven days. In addition to utilizing DECs offsite storage and ReCoverall programs, InsurAll decided to subscribe to DECs coldsite service. The coldsite costs about $400 monthly and put the finishing touches on their hardware and software backup strategy. In the event of a disaster, InsurAll would call one 800 number that would set in motion all their recovery needs field service, offsite storage retrieval, ReCoverall, and coldsite readiness. For about $600 a month, the safety net was set in place.
There is no doubt that telecommunications must be considered in detail when creating a contingency plan. For many companies, this single issue is the most confusing. It is usually the stumbling block that stops the plan. However, our presentation by Jim Stratman of Total Asset Protection, Arlington, Texas simplified the alternatives available to InsurAll. With his help, the communications problem was solved relatively easily.
InsurAll did not have complex communications needs at the time their plan was being created. They needed to be able to communicate between the recovery center and the user location. The vendor coldsite had been built with communications in mind and had ample telephone lines that could be used to create a dial-up communications system. No matter where the users would be located, the communications, via telephone, could be in place when needed. Modems at each end of the telephone line would transmit communications between the users and the recovery center. It was noted that the response time in the dial-up mode would be considerably longer than when devices are locally attached, but under the circumstances, even a long response would be better than none.
Session three began with a presentation by Linda Burkett,Computer Operations Manager with Bergen Brunswig Corporation. The presentation covered the Disaster Recovery Manual and itscomponents. Linda described a contingency plan as havingthree major components.
The first major component of the contingency plan is the recovery manual. The recovery manual consists of crisis operating procedures used to evaluate, react to, and recoverfrom an incident. The recovery manual should not be confused with the contingency plan. Frequently the two are used interchangeably, but they are two separate issuesthe manual being a part of the contingency plan.
The second main component of the recovery plan is the recovery center. In the event that your main processing facility is destroyed, an alternate recovery sight will be necessary. Without proper advance planning, your recovery may come to a halt. One company in Southern California that was struck by a disaster found itself with a backup CPU made available within three days from their hardware vendor, and no place to put it. Their data center had been gutted, and arranging an alternate facility took almost three weeks.
The third major component of the contingency plan is offsite storage. Several items of importance may be stored offsite, but most important are backups of your data, programs, and system. Simply put, you cant recover what you dont have. If its not stored offsite, you are putting your company and career in great jeopardy. In addition to the above, there are other items that you may want to store offsite including; a copy of your contingency plan, a supply of custom forms, a supply of magnetic tapes/cartridges, cash, system manuals, and run books used for daily jobs. Your offsite requirements are dictated by your operating environment, so everyones essentials will vary.
On October 1, 1987, an earthquake measuring 5.9 struck the city of Whittier, California, causing massive destruction of California Federal, a California based Savings and Loan. Based on their experience, a list of on-hand essentials was compiled. While most of the items listed above should be stored in environmentally controlled high security facilities, these items can be stored in almost any secure location. They include:
Tables and Chairs
Minimum Processing Schedules
Spare Controllers and Terminals
Hand Held Tape Recorders
At this point in the project, we were concerned with the recovery manual, and by using the above description, we began putting the skeleton of a manual together. We started with a simple table of contents that outlined the major section, which included:
1. Introduction to the Contingency Plan
2. Procedures to Follow for the Activation of the Contingency Plan
3. Damage Assessment
4. Personnel Assessment
5. Contingency Planning and Recovery Teams
6. EDP Supplies
7. Offsite Storage
In addition to those listed above, other sections were discussed, including:
1. Production Status and Continuity
4. Software Recovery and Technical Support Responsibilities
5. Cash Fund
6. Restoration Planning
7. Testing the Plan
8. Maintaining the Plan
9. Contingency Strategies for Minor Service Disruptions
Using the above outline, we began developing a more detailed manual by dividing the major sections into sub-sections. Due to space limitations, I am unable to recap every section and sub-section, but the following should give you enough information about the planning process to understand the major concepts.
Introduction to the Contingency Plan
The Introduction was divided into five sub-sections; Plan Purpose, Assumptions, Objectives, Constraints, and General Information. Each of these sub-sections gives the reader a more defined concept of what the plan is supposed to do, and what it is not supposed to do. These sub-sections outline the goals of the plan, and let us know why the manual and plan are designed in this particular manner.
The purpose of InsurAlls plan was to reduce the number of decisions to be made when a contingency occurs, thereby minimizing the effects of a disaster on their company. In addition, it states that the plan represents a commitment by the Management to provide the required resources to prepare for the contingency before it occurs.
The plan was also designed within certain assumptions. First, that some type of contingency could occur that would be of significant duration so as to incur substantial losses to the company. Second, that a recovery to a limited production environment in the shortest amount of time, followed by recovery to a normal production environment is the most desirable objective. Third, the plan also assumes that the company will be recovering from something less than complete destruction. Fourth, that the plan is not a rigid set of rules. Finally, that some occurrences may be outside the manuals stated goals and good judgement should be used to handle undocumented situations.
In addition to the other elemets, the plans Introduction contained a sub-section of general information. It stated that based on the risk/impact analysis, the company could operate without data services for a specific amount of time. If the contingency was expected to exceed that amount of time, Computer Operations Management would be responsible for establishing alternate data center operations as soon as possible. It also stated that media offsite storage was utilized and described where more information about offsite media could be located.
Procedures to Follow for Activation of the Contingency Plan
The Activation Procedures were divided into six sub-sections. They included; Procedures for Activation of the Contingency Plan, Flowchart of the Entire Contingency Plan, Notification Procedures, Team Assignments, Contingency Plan Meeting Locations, and MIS Position Cross Reference Listing.
The activation procedures developed for InsurAll were of the pyramid type. The procedures outlined how the senior ranking employee on the scene would notify Computer Operations Management, who would in turn notify upper management. It also outlined who would make the activation decision and what steps and consequences were involved in that process.
To give the reader a better understanding of how the contingency plan was designed to work, an illustration or flowchart (See Illustration 1) of the overall plan concept was developed.
The notification procedures are somewhat like the plan activation procedures, except that they involve notifying senior management and off-duty MIS employees if the plan is activated. The notification procedures in this section are made by a notification status team. Within the notification sub-section is a checklist that is used to keep track of notifications, each persons status (whether or not they are available), recovery location, and expected arrival time. This checklist is used to assign alternates, if necessary, and to coordinate activity timing.
The Activation Procedures Section also contains information about meeting locations. In the event of a disaster, contingency operations will be handled out of a specific conference room, but in the event that the facility has been damaged, alternate meeting locations have been selected. Prior to the notification of employees, management will decide which meeting location to utilize, and when each person is notified, they will be told which meeting location has been selected, and what time to meet. Within this sub-section are verbal directions to each location as well as maps, that can be given to employees, that outline travel routes. The condition of roadways will be taken into account before deciding on a meeting location.
The final sub-section is the Position Cross Reference List. Each job title within the company is listed along with the name of the employee currently filling the position. Throughout the manual, responsibilities are linked to the job title. When a different employee fills a job title, only the position cross reference list needs changing.
Damage Assessment Planning
Damage Assessment Planning was divided into four sub-sections; Physical Site, Computer Room, Equipment, and EDP Equipment Outside Computer Operations. Though there are four separate sub-sections, there are really only two main areas of concern, physical site damage and equipment damage.
Physical site damage, whether or not it is the computer room, is concerned with such items as flooring, electrical, ceiling, walls, doors, windows, plumbing, fire suppression equipment, and water detection devices. In any case, a checklist is used to assess damage. For the computer room, specific items are listed for inspection, whereas outside of the computer room, the form is generic, and the assessment team has to fill-in damage items specifically. Basically, the form asks who is completing the document, the item assessed, if the item is repairable, and the approximate date of repair or replacement. In addition to this information, each damaged item must be detailed on a separate sheet of paper for each area, which is then attached to the checklist.
Damage to equipment is handled much the same way, with a checklist (See Illustration 2). Equipment damage assessment, however, requires the presence of a customer engineer from the representative computer hardware company, and a telephone company representative for communications damage assessment. And much like the physical site checklist, the item being examined is checked for damage, marked as to whether or not it is repairable, approximate repair or replacement date, and who performed the damage assessment. For equipment not in the computer room, the department and location are also required.
To help speed the recovery process, all damage assessment forms are required to be submitted within three hours of a disaster alert.
The Personnel Assessment section was divided into two sub-sections consisting of Personnel Assessment and a Personnel Assessment Form/Checklist.
The Personnel Assessment sub-section included such items as the phone numbers of the computer operations department, technical support department, and data processing management. The list would be used by the personnel assessment coordinator to assess employees availability to participate in the recovery process and to assess transportation needs for the transportation team.
As the personnel assessment coordinator makes contact with employees, the personnel assessment list (See Illustration 3) is updated with information, such as if the employee was injured, and when they would be available to work. Transportation issues are also noted at this time, such as if an employee can get to the data center, or the backup center, without assistance.
Contingency Planning and Recovery Teams
The Contingency Planning and Recovery Teams section was divided into four sub-sections, including; Contingency Plan Team Concept, Contingency Plan Management Team, Damage Assessment Team, and Personnel Assessment Team.
The most widely accepted approach to disaster recovery is the team approach. Since companies vary in size and design, and the talents of each staff vary, the number of teams you will have and their functions may vary considerably from what was proposed in this project. What really is important here is the concept, and there are some team characteristics you may want to consider. A team may consist of one or more persons, and a person may be on more than one team. A team member does not have to be from within MIS and a team member should always have a replacement or alternate. Based on these rules almost any combination is acceptable. The key to a successful team approach is making sure that a person is used effectively.
There are many types of recovery teams. Some of the most
widely used include:
4. Damage Assessment
5. Personnel Assessment
6. Offsite Storage
8. Systems Programming
9. Application Programming
11. Original Site Restoration
12. Public Relations
14. User Interface
16. Contingency Plan Status
17. Offsite Non-Operations Requirements
18. Contingency Center Operations
A large data center with many employees and sophisticated systems may require most of the recovery teams listed above. While each employee has a single or few tasks to perform, each task may be logistically difficult and very time consuming.
Realistically, a recovery team member of a small shop may be expected to perform several tasks of a shorter duration simultaneously. Some small MIS shops have as few as three teams; management, operations, and programming. But they still perform the same basic functions as larger shops that have more teams and team members. For example, the management team may act as the notification, public relations, contingency plan status, damage assessment, clerical support, and purchasing team. And the team may be made up of more than just MIS personnel, such as the Vice-President of Finance, MIS Director, Controller, Computer Operations Manager, Manager of Systems and Programming, a purchasing specialist, and a secretary. In addition to their duties on this team, team leaders may be leaders or participants on other teams. The key to successful development of teams is knowing the strengths and weaknesses of personnel in a stressful environment. Everyone acts differently in a crisis, and knowing how each person acts may influence the manner in which you organize teams.
Within each team, knowledge of several items in needed; team leaders title, alternate team leaders title, team members titles, and tasks to be performed by each member. Frequently, task checklists contain spaces for notes and signoffs, and reference materials for more information about executing the tasks. They may also include statements of critical paths, that show how tasks interrelate to other recovery team tasks.
Checklists may also be designed as a small booklet in that can be removed from the manual, so that those performing the tasks have their own copies and do not have to refer back to the team leader manual. In a crisis, no one will want to carry around a two-hundred page book when only a few pages may be necessary.
As tasks are performed and checklists updated, they can be inserted into, or used to update, a command center notebook so that all recovery teams have access to completed task information at a central location.
EDP Supplies was divided into four sub-sections, including; Custom Forms, Stock Forms, Magnetic Computer Cartridges, and Miscellaneous. In this section, the intent was to list the supplies that are used in daily operations, not what is stored offsite.
Each sub-section has two lists. The first is a checklist that can be used to assess available stock levels against minimum stocking quantities. In the event an item is partially or fully destroyed, the supplies person would indicate how much of the item is in stock, and determine the amount that needs to be ordered. The second list contains the names, addresses, and phone numbers of suppliers and backup sources for each supply item. In an area where there is the threat of a regional disaster. The list should contain the names of suppliers outside the immediate area.
The Offsite Storage section was divided into nine sub-sections. They included; Offsite Storage Location, Retrieval Procedures, Employee with Offsite Access Cards, Offsite Backup Procedures, Backup Information, System Backups, Forms Stored Offsite, Transportation of Backup Datasets, and Audit Procedures.
As mentioned earlier, offsite storage of essentials is a major component of a contingency plan. This section lists the items stored offsite, and sets procedures for retrieving those items. The offsite storage of data is a primary example of how a contingency plan should be integrated into your daily routine in such a manner that it is no longer thought of as extra duty. It becomes just another part of doing business, which is exactly what contingency planning should be. It should be considered the cost of doing businessit is not a luxury.
By now, you should have an idea of how the recovery manual fits into a contingency plan, and its basic structure. All of the different sections and sub-sections are designed basically the same, which makes the manual easier to use. Trying to keep it simple while covering all the bases can be challenging, but in this particular case, the use of checklists was one manner in which the recovery process could be simplified while making sure that no item was neglected. Also, the use of job titles instead of names simplifies the task of maintaining the manual.
Our fourth session began with a presentation by David Williams. The presentation focused on user contingency plans, and the users role in a crisis.
When a contingency plan is developed, it is usually designed with one thought in mind, to save and restore the data center. We tend to forget about the user who, during a crisis, is left to fend for himself in a world of impatient (and often irate) customers and suppliers. Customers dont care if there is a crisisits not their problem. They want service. And suppliers want to get paid, or they wont deliver.
Although it may not be a desirable task, users will need assistance with developing manual business procedures for system interruptions. In many cases they cant develop backup procedures without help, and in most cases, without brow-beating, they simply wont do it.
For the most part, we are the experts, and have the technology or the expertise to solve system downtime situations. And problems dont have to be of crisis proportions to require the implementation of a manual backup system. They may also be used during system upgrades, maintenance, system backups, or any time the system is brought down during operating hours.
The easiest way to develop good manual backup procedures is during the development of a system. Many businesses are now including contingency planning, both MIS and System User Plans, into their Systems Development Methodology (SDM). When a new system is proposed, contingency planning is built into a systems design. For example, one company has an online system that is used to make physicians appointments. If the system were to be unavailable, total chaos would occur. First, the receptionist wouldnt know whos coming in, and second, she wouldnt be able to set future appointments. The solution was quite simple. Each night, a copy of the appointment calendar is sent to a spooled output queue, and copied to diskette. If the system were unavailable, the appointment calendar is simply printed from the spooled file on another machine. By distributing the report, they then have ability to operate the system manually for thirty days. In this case, a small addition to the original system design specifications created the ability to continue conducting business and to provide timely service to customers even during a crisis situation.
Adding contingency planning to SDM is the ideal situation, but in most cases, it will be a retrofit. Creating a contingency plan after a system has gone live can be a nightmare. When automating, the old manual procedures are still fresh in the users mind, and the manual forms are still around. But after a few months, no one will remember how they did it in the good-old-days and the manual forms will be long gone. Can you imagine trying to recreate them?
I recently had the opportunity to assist a group of users in the development of manual backup procedures, and after several phone calls, meetings, and memos, produced a user plan that has been used successfully on numerous occasions.
The plan was developed for Knotts Berry Farms Information Center. Those of you unfamiliar with Knotts should know that it is an amusement park with dining and shopping facilities, and is a major producer of food products in Southern California (jams, jellies, and preserves). It is Americas oldest themed amusement park and has more than three and one-half million guests yearly, and their Chicken Dinner Restaurant serves more than one and one-half million dinners yearly. The Information Center sells annual passes for admittance into the amusement park, gift certificates for shopping, and certificates for breakfast or dinner for either of two main restaurants. The system is online and provides a valued service to guests. Losing the ability to sell passes and gift certificates means disappointing patrons and losing revenue. The challenge was to develop a plan for maintaining the high level of guest service even during system downtime.
We began the process by defining end-user contingency operating procedures as Alternative written procedures used by computer end-users to continue business operations during an extended computer outage. Armed with that definition, we divided the user contingency plan into three major components.
The first component of the plan was a manual that contained the actual procedures that would be used to implement the manually operated system, and enter data into the system once the computer was again available. The procedures were designed to tell them what they needed to get started manually, and where to find it. After setting up, the procedures explained in detail how to sell the annual passes and gift certificates using the second component of the plan, manual tracking sheets.
The automated system is designed to assign serial numbers to annual passes and gift certificates. To be sequenced correctly when the computer comes back up, they need to be able to track the sales on tracking sheets. These forms are used to record sales information that would need to be input into the computer when recovered. Using the sales tracking sheets, the data could be input sequentially into the system by transaction, a requirement for accurate sales reporting and accounting.
The third major component of the manual system consisted of reports containing product pricing information that would be necessary at the time of the sale. The reports are printed whenever a price change is made in the system, and inserted as an appendix item into the backup manual (See Illustration 4).
With those three major components (the procedures manual, backup forms, and pricing reports), the system users could fully implement a manual system in less than five minutes. During the last year the manual system was implemented more than a half dozen times.
We learned that the major obstacle to the development of the manual system was a lack of skills. No one at Knotts had created a manual backup system before. The process, however, turned out to be familiarizing ourselves with the system and how its used (analysis). We met with the manager of the Information Center and explained our concerns about system downtime and the impact it may have on the companys ability to provide guest services (getting management support). We later met with the manager and lead personnel to explain the contingency planning process and what would be expected of each person (user education). We again met and agreed upon the content of the manual and listed each persons tasks and responsibilities (concept and design). In less than two weeks, the entire manually operated system was developed and was ready for testing (system development). The manual backup system, successfully tested, was implemented in less than three weeks.
Having a good manual backup system does more than just allow the user to conduct business when the computer is down. It also has a significant impact on the type of contingency plan you eventually develop. By reducing user dependency on the system, you buy extra recovery time, directly influencing the complexity of your contingency plan. The bottom line is, the simpler the plan, the easier it is to implement, test, and maintain.
It has not been my intention to oversimplify the development of user contingency plans. They can be frustrating when there is little or no support, and complex systems may require complex solutions. But the point is that user contingency plans for important systems should be considered as a part of a corporate contingency plan, and if you can include contingency planning into systems development methodology, it will become a routine that will save time and money.
According to our project timeframe, in Session Five, InsurAll would be presenting an overview of their contingency plan and a draft of their disaster recovery manual.
The concept of our project was to assist a real company through all the steps in the development of their contingency plan. Usually, disaster recovery groups present guest speakers and spend countless hours theorizing about disaster recovery, and our group was no exception. While the information presented is invaluable, without a practical application, the ability to utilize the information is never manifested and the information becomes useless. This project was designed to be a real-life experience where the information could be presented in a logical order. Forget theory! We wanted to know what really happens in the development of a contingency plan. What succeeds and what fails where are the pitfalls? Up to this point, we had succeeded. We had succeeded in finding several differences of opinion. Perhaps the most profound argument concerned the concept of contingency planning itself. Some members felt that the disaster recovery plan was contained in a manual, while others felt that contingency planning is made up of everyday activities that are not necessarily written (i.e. manual backup procedures for users, regular daily data backups, and security features). There were disagreements about whose responsibility contingency planning is and is not (i.e. user, MIS, management, all of the above), and who should lead the planning effort. We differed in our team recovery approaches and how to budget for contingency planning. Each member had his own concept of how a recovery manual should be designed and they varied greatly in their ideas of recovery manual content. But even with these fundamental differences, generally the group agreed on the concept and basic design of InsurAlls recovery strategy. After five months of planning, our members were ready for InsurAlls presentation.
Anyone who has ever developed or attempted the development of a contingency plan understands the commitment required to create and implement a plan. THE FATE OF CONTINGENCY PLANS ARE HELD BY A FINE THREAD. The thread consists of managements support, a very limited budget, uninterested system end users, and an overworked data processing staff. Few companies have the luxury of a budget or staff that adequately supports contingency planning needs. When any of the threads components are lost, the thread breaks. In our fifth session, the thread broke.
Somewhere between our fourth and fifth sessions, I met with InsurAlls representative who informed me that the person developing the plan had resigned and that InsurAll would be unable to continue with the project because of staffing limitations. It appeared as if one of the classic fates for contingency planning was happening to us. When staffing gets tight, contingency planning takes a back seat to other projects. It was my task to break the bad news to the executive committee.
Understandably, there was great disappointment. The project at which we had dedicated much time and effort was at an end and we were unsuccessful in our endeavor. Our project to create a contingency plan for InsurAll and to guide our members through the contingency planning cycle had failed.
In our final session however, all of our efforts proved worthwhile. At the end of the session, I was approached by a member who had been attending each of our sessions. Following along with our planning, he had developed a contingency plan for his company exactly what we had hoped would happen and he asked if I would review his plan.
His recovery plan was structured much like the one suggested by Mr. R.P.R. Gaade, in his article Picking Up the Pieces, published in the January 1980 edition of Datamation. The recovery team responsibilities were based on the U.S. Dept. of State Information System Security Handbook.
Basically the plan was divided into 8 sections. The first seven comprised the main body of the recovery manual and the eighth section was Appendix information that would be used during the recovery process. It contained a recovery checklist and the recovery plan overviews. The main body included: an introduction; a list of the recovery teams, their team leader and alternate, and their tasks and responsibilities; emergency response information for handling power failures, fire, earthquakes, flooding, hazardous material, bomb threat, and other calamities that frequently incapacitate data centers; the actions for assessing damage during a major incident; steps required to implement processing at a remote location; how to restore processing at the primary location after the disaster is over; and how to maintain the plan. The entire plan consisted of about 70 pages which provided a framework for recovery capability (see Figure 1).
Their recovery plan was designed as a methodology for recovering MIS in the event of a disaster, but did little for the end user during the outage. After carefully reviewing the manual, I returned his document along with a few notes. All-in-all the manual was well done but lacked some information in areas that are typically overlooked in a recovery manual.
One item I noted was the definition of a disaster. There are several levels of a disaster. An example of a multilevel definition is:
1. Level 1 - disaster can be handled by company personnel alone.
2. Level 2 - the disaster recovery will require some outside
intervention (police, fire department, plumber, etc.).
3. Level 3 - the disaster will require help from multiple external organizations. Some companies define the disaster in terms of the length of the computer outage:
1. Level 1 - an outage of short duration with no impact.
2. Level 2 - the outage requires users to implement manual procedures.
3. Level 3 - the outage will be for an extended period of time and some data has most likely been lost.
There are several ways to define a disaster but once a disaster has been declared, a methodology for notifying each team member should be implemented. His manual did not outline the method for notifying team leaders or alternates. The pyramid approach is perhaps the most effective and the most popular although many companies decide to have notification teams. The determining factor should be the size of your recovery group. The pyramid calling scheme works well for small data centers while the notification team approach works well for large MIS groups. In the pyramid scheme the disaster recovery coordinator notifies team leaders about the disaster who in turn notify team members. A notification team will call all personnel and report the availability of each team member to the team leaders. Alternate team members are contacted if necessary.
Another area of concern I had was in identifying an assembly area or command post following a disaster. In a large disaster the data center may be unusable and an alternate location necessary.
Arrangements for an alternate command center should be made in advance with provisions for enough telephones to communicate with company personnel, government agencies and other outside resources. Determine in advance who in your company has a mobile phone that can be used following a disaster. After an area wide disaster the telephone service may be unavailable and a mobile phone could be your only outside link.
I had several other notes in the manual but most of them were specific to that particular company. Two additional comments I would like to make concerning disaster recovery are general in nature. First, during a crisis people need to be organized and need to know exactly what to do. Unfortunately but primarily for security reasons, only a few copies of the disaster recovery plan are printed and distributed. Following a crisis, each team member will need a list of the tasks for which he is responsible. Many people argue that during the testing phase all members learn what to do during a crisis, but its been my experience that recovery is rarely executed by exactly the same people as those who were trained. I suggested that multiple copies of each recovery teams duties be printed and inserted into the appendix of the recovery manuals. If a disaster occurs, the copies of the team tasks are given to each team leader for distribution to the team members. My second comment was for the inclusion of a recovery overview that is also pre-printed for distribution to all recovery teams. The overview would show the sequence of recovery tasks which should be executed. A good format to use is a Pert Chart of disaster recovery tasks. Each team can understand how their duties fit into the total recovery scheme and how their activities are related to other team tasks. The chart will identify the critical nature of each team completing their task on time so that other teams can begin or continue their duties. The chart can also be used as a check-off list, noting the completion of various tasks.
While this plan was mostly typical, in another way it was quite unusual. Most disaster plans use the Superman approach one super human who does just about everything, and generally that person is from data processing. Disaster recovery plans rarely make use of non data processing personnel in the recovery process. This plan enlisted the help of several outsiders. Some team members included accountants for verifying the accuracy of recovered data, secretaries for preparing purchase orders, accounts payable clerks for preparation of C.O.D. checks for small purchases and maintaining a cash fund, and building maintenance personnel for the recovery of utility services and structural repairs.
The plan also made good use of checklists to assess the functionality of equipment, the physical site, backup data, supplies and documentation, and notification of clients and vendors.
The development of a disaster recovery capability is a complex task full of pitfalls and is usually thankless until needed. Plans are difficult to create and just as hard to maintain. During our project we learned several important points:
1. A disaster recovery plan requires top managements support in terms of financial and human resources to be successful.
2. A solid recovery capability includes the ability for system users to be able to function manually during an extended outage.
3. A disaster recovery plan is more than just a manual. It also consists of tasks that are executed as a part of the everyday
operations (i.e. backing up data regularly, differing levels of
security, keeping the data center clean, installing a fire suppression system and maintaining it in good working order, etc.).
4. A recovery plan must be reviewed and updated on a regular basis to remain current.
5. A recovery plan is only as good as the people who will be executing it. Those people must be trained on how to perform their duties.
6. No matter how good the plan and the people are, testing will reveal weaknesses that no one could have anticipated and ones that should have been. One point to remember during testing is that people dont fail, plans fail. People are not being tested, they are being trained.
7. Your plan must be practical. Reams of unnecessary paper are impossible to maintain and can hamper the recovery process. The plan can be so complex that it becomes unusable.
8. People act differently in a crisis. When you select a team leader, take into account the persons ability to function in a crisis. The hierarchy of a disaster recovery team may differ from the companys traditional hierarchy whereby subordinates may become team leaders and vice versa.
9. Disasters are frequently preceded by advance warnings and a mechanism for preventing disaster from occurring should be in place. Stephen Fink, in his book Crisis Management, defines a crisis as having four phases. The first phase is defined as prodromal or the warning stage in which quick action can avert a major disaster.
10. You cannot develop a good contingency plan by yourself. You are going to need help from your system users, data processing staff, management, vendors, telephone company and others, most of whom will be willing to help if you would just ask.
Disaster recovery plans that really work take time and patience to develop. Most of all, for a plan to be successful, it must be practical. In this series of articles on contingency planning, I have tried to present one groups experience in the development of a contingency plan. I hope you have found it informative.
My thanks to the InsurAll company, the ACP Orange County Executive Committee, our guest and member speakers, and most of all to our ACP members for a job well done.
David Williams is a Senior Programmer/Analyst at Knott's Berry Farm, Buena Park, California. He served two years as Information Manager for the Association of
Contingency Planners (ACP) Los Angeles Chapter, was a member of the ACP National Board of Directors, and was co-founder and president of the ACP
Orange County Chapter in 1987. He completed his Master's Degree at the University of Redlands, California, where his thesis was titled, "Disaster Recovery
Planning for the Small MIS Department.
This article adapted from Vol. 1 No. 2, p. 5; No. 3, p. 4; and Vol. 2 No. 1, p. 4.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.