Developing a Contingency Plan for Your
By Jim Chettle
The contingency planning information in this article is limited in scope to the tele-processing (TP) network and does not address the
requirement for full restoration of the data center. The TP network discussed here consists of everything outside the main frame.
The network includes the front-end communications processors, the telecommunications lines, modems, multiplexers, and the
remote user devices.
Each of your TP networks is different in some way from the others, and therefore plans can best be formulated after a careful examination of all the components in each network. This implies that a thorough inventory be undertaken.
Following a disaster, all your TP networks will not have the same level of importance in becoming operational again. You must obtain the recovery priorities from the users of the network. The users opinions of the critical nature of each network and the resultant cost to recover must be approved by a management level high enough to warrant continuing your planning efforts and expenditures. More simply put, verify the critical nature of each network before expending a lot of effort and money.
Portions of your network may not be critical now to the continuing business functions of your company. However, these portions should be inventoried with the others, included in the plan, and identified as non-critical. The reason for including these is covered later in this article.
Other portions of your network may be identified as extremely critical, and therefore may require special treatment in your recovery plans. You may want to address recovery of extremely critical services prior to completing your entire plan.
Contingency plans are critical to the survival of your company following a disaster. This implies that they should be frequently reviewed and updated. It is wise to put the plans together in a modular fashion to facilitate updating. A modular structure allows the updating task to be distributed to several individuals or groups.
If contingency plans are critical to your companys survival, they should be tested frequently. Testing is the only way you will know if your contingency plan works. Your hardware, software, networks, and applications continually change and your plans must stay up-to-date with your environment. Testing will be an ongoing task to ensure that your plan is still current.
After you complete your contingency plans, you may want to share them with your vendors of communication equipment and facilities. The vendors can verify and critique your plan and be in a more informed position to react quicker to your recovery plan activation request, if they already know what is required of them. They may accept standing orders to activate their portion of the plan upon a call from your network staff.
Most data center recovery plans will contain thorough lists of vendor contacts, employee home telephone numbers, etc. I recommend that the TP recovery plan contain a customized contact list, which can be used to quickly organize the communications recovery team. The list can contain the vendors phone numbers used for reporting trouble, as well as home numbers of key vendor contacts.
Keep a copy of your recovery plan, inventory list, and contact lists at an off-site location. Most companies store magnetic media off-site, and the complete recovery plans could be kept with the media. I recommend that key individuals also keep a copy at home. Additional copies can be stored at your designated hotsite or coldsite.
The task of gathering information can be completed in several ways, but your completed work should be an auditable document containing every component of the TP network. This is important, since you never know which part of the network you will be recovering. You may be asked to recover a large user site, all sites in a geographical area, or the entire data center communications.
The inventory listing can be organized in several ways, and you must determine which works best for your situation. One method is to set up a master list of all of your organizations networks. Identify each by the name used in everyday discussions, such as; teller terminal network, sales support network, plant multipoint network, etc. Another method is a list by facility type or common carrier service names. This would allow you to address the recovery of types such as analog multipoints or high speed digital pipes in separate sections of your plan.
Make sure that you list all networks and devices using host access. The reason to list and classify all TP networks even if they are classified non-critical is because they are currently being used by someone, otherwise they would not be part of the network. The critical nature of any component could change at any time, and without a complete inventory list, recovery would be difficult.
Some sources for compiling your inventory list are: host/front end processor line listings; invoices from vendors of communication facilities, modems, CRTs, etc.; service charge detail listings from the telephone company; maintenance contracts; and trouble call phone lists. It is possible that your existing network support documentation contains all of the information necessary to complete the list. My experience has shown that this documentation is as good a place as any to start.
Whichever inventory listing method you chose, it should comfortably fit into a scheme which allows the users to identify those portions which they consider critical to their business plans. The users should classify the items on the listings as:
extremely criticalmust be recovered in 1 day; very criticalmust be recovered within 3 days; criticalmust be recovered within 5 days; etc.....down to non-criticaldo not recover.
When you ask the users to classify these networks, dont be surprised if they cant respond quickly. Sessions with the users can be very time consuming but it is essential that the user is the one who classifies the networks.
The users are ultimately paying for the recovery scheme and they may be directly charged for their portion of the recovery costs. You should explain to them how you are going to recover, what you can recover, an estimated cost of the recovery, and the recovery scenario. The scenario should be a brief description of the recovery method you will use for most networks. Two sample scenarios are: sample 1 all processing will be done at a hotsite located out of town; all user sites will use dial-up access to the hotsite location; dial-up access will double the normal response delay; users will call the hotsite only when they can batch the work; sample 2 processing will be done at ABC Co. after 5:00 pm; in-town users will gather all input and drive to ABC Co. for key entry; out-of-town users will dial-in via XYZ packet network to ABC Co.
Recovery schemes which will reduce performance or response time should be discussed with the users. For example, if your recovery scheme reduces line speed, response time would suffer. Although this may be acceptable to data entry locations which could re-allocate some of their work to off-shift times, it may not be acceptable to locations which have customer service functions such as teller windows. Customer service users may require the same response level as normal and this can create a more expensive recovery scheme than for the data entry locations. If the user does not agree with your recovery scheme, negotiations will be necessary since various TP networks may be more vital to the users than you surmised. Upper management will make the final decision based upon the facts presented and their assessment of various business functions.
Approvals by Management:
Cost vs. Risk
Very early in the classification process, upper management should be involved to provide direction. They should be the driving force behind all contingency planning and should express their opinion in the area of cost vs. risk. If you could recover the entire company network for $XX,XXX or could recover just the
critical networks for a lot less, you can be assured that management will choose the lot less. Even though the lot less may be your general guideline, I recommend that you go through the exercise to cost out the complete recovery, since it may not be too far out of line with a partial recovery, depending on the size of your network.
When you present management with your recommended plan and the alternative plans, try to include your best estimate of risk-of-failure with each plan. For example, the dial backup scheme may be occasionally problematic due to the lack of inter-city lines in a regional disaster. A completely redundant
facility linked to a remote hotsite could provide instant availability following a disaster, and have a high likelihood of being available even after a regional disaster. The risk factor in the recovery plan is sometimes difficult to determine. Your vendors or a consultant may be able to help you determine this factor.
Putting the Plan Together
Lets assume that you have completed the inventory classification, set up a skeleton recovery document, received approvals from management, and are ready to write a formal plan. Through past experience I know that you will discover updating and rewriting your plan is a never ending process. User business requirements change, networks change, main frame applications change, and the critical classification of networks change. The structure of your recovery plan should be modular so it can accommodate changes easily. A modular plan could contain sections such as:
1. Statement of overall intent of the recovery plan
2. Inventory of the communications network components
a. Multipoint plant and sales office lines
b. Teller network
3. Host site inventory
a. Front end processor
4. Contact lists
b. Network support personnel
c. User sites
5. Step-by-step recovery
a. Extremely critical networks
b. Very critical networks
c. Critical networks
a. Line listings for front-end processor
b. Host command lists
7. General support information
a. Modem/multiplexer manuals
b. Dial backup numbers at remote sites
c. Access numbers for switched/packet networks
Customize each module in the plan based upon the expertise available to update it. For example, if a Software Support Group handles the front-end processor GENs, they can handle the software section of the plan. Similarly, a Network Operations/Control Group could handle the contact lists and the general support information sections.
The Step-by-step Recovery Plan
The Step-by-step portion of the plan is the most difficult module to write. This part can only be proven by testing. The writing will require a great deal of thought because things you take for granted as common knowledge may be unknown to the person who ultimately executes the plan. The best guideline I can give you in this endeavor is to bore them to death with detail. One of the steps should be the ACTIVATION SEQUENCE which contains at a minimum the following items:
1. List of vendors and key individuals who must be notified, the method of notification, and the follow-up/verification agreements.
2. List of the actions you expect each vendor and individual to perform. This could contain the standing orders mentioned earlier, but must be very detailed, including every step each vendor or individual involved must perform. (An example of this is listed below.)
3. Location and phone number of the recovery command center for vendors and individuals to report status/completion.
4. Employee travel and expense arrangements, and location where each function will be performed. This might be a hotsite, a coldsite, or another out of town facility which would require manpower. Some plans have this section in the overall master plan, but it is wise to repeat it in the communications plan, even if it is abbreviated.
5. The activation priority list for each network type.
Detailed example of step 2 in activation sequence
Step 2 above is very critical to the success of your plan. The details in this section should include every task to be performed. If for example your plan depends upon dial backup from a user site to a remote hotsite, then your instructions should contain:
- the hotsite dial-up (modem) telephone number and the alternate number
- the hotsite contact telephone number
- the user location information; names, phone numbers, home phone numbers of supervisors (include the modem dial-up numbers)
- a general description of the type of processing performed by the user. This could be: 3278 CRTs accessing CICS DDA
- the type of modem at the user site, speed, terminal type and protocol
- the communications company for both the phone line at the user site and the long distance carrier, with phone numbers for problem reporting
- the procedure for establishing a connection, who calls who, on which phone line, switches at which location operated in what order, etc.
- the timing of connections, such as 24 hours or 8:00 am - 5:00 pm
- name and phone numbers of support organization(s) to call if the equipment doesnt work at the user site
- this is very important....give the users a copy of this plan and keep it updated. The users will need the phone numbers for problem reporting, etc.
If your plan required your communications vendor to switch your computer site leased lines to a remote location which has standby lines and a switching arrangement, then your instructions should contain:
- overview of the switching scenario, including names of the services as the vendors know them (diagrams might help)
- the telephone number of the communications vendor for activating the plan. An alternate number, preferably of a supervisor or manager level person at the facility responsible for controlling the switching
- the passwords or other information required by the vendor to authorize the connections
- steps required by your operators at both the user end and the remote computer location which are different from the normal operation
This type of detail plan should be shared with your communications vendor. They can verify your instructions and keep a copy for their own information.
Another section of the plan should contain the PHYSICAL SETUPS at your recovery site. This must be in enough detail to allow a technician to establish your network from your plan write-up, without prior knowledge of your operation. An sample write-up is:
a. Brief overview of each type of line in the inventory, describing the communications controller line type, modem type and speed, communications line type, and user site device types. The vendor names for each of the above should be included with reference to the section in the plan which covers the operating instructions. This overview should allow an individual to understand the whats, wheres, and whys of each type of network, so that a smooth activation can take place, even if some components have changed since the last update.
b. Repeat the activation priority list in this section
c. Software specifics, for communications control processors
d. Connections to be made either by cabling, patching, or matrix switch, with each component mentioned by name/type/vendor, etc.
e. Location of each component (maps and diagrams may help)
f. Vendor support information in case of problems with components. This problem assistance information should be verified regularly.
Summary, part 1
As you probably have guessed by now, the details can be endless. It has been suggested that the only way to verify that the instructions contain sufficient detail, is to test them with individuals who havent previously read the plan. I tend to agree with this idea, and suggest that you try it, after....you have tested your plan thoroughly with your best qualified support personnel. The tests by qualified personnel proves that your techniques will work. The other tests will surface any missing details.
Information given in the previous sections should enable you to construct a skeleton communications recovery plan. Its up to you to fill in the details.
Once your general structure is in place, any missing items will glare at you when you test the plan. Testing... is the most important part of your plan. Without regular testing, you cannot honestly tell your company management that you have a viable recovery plan. Without regular testing, you will not ferret out omissions or changes in your environment. Test the plan!
PART II - DISASTER RECOVERY METHODS FOR TELEPROCESSING NETWORKS
As discussed in Part I, you can begin the formulation of your recovery plan after you have completed your network component inventory. Methods to recover your particular network are numerous, but are certainly influenced by the following listed items. Before discussing recovery methods, Ill discuss these items, and offer a few general observations.
ITEMS WHICH INFLUENCE RECOVERY METHODS
- Type of Network
- Time Requirement for Recovery
- Percent of Network to be Recovered
- Budget Considerations
The cost of recovering networks varies greatly by the network type, by the time required for recovery, and by the percent of your network you must recover. You can spend a lot of time dreaming up new ways to recover your network, but the result probably will resemble other schemes supported by your hot site, cold site, and/or communications and equipment vendors. Ask them for recommendations for your network recovery scheme. You can customize as required. These vendors may be critical to the success of your plan, and will do a better job supporting a plan which they helped develop. In the end, you and your management must decide which scheme best fits the goals of your overall contingency plan.
Interest in a contingency plan was probably generated by managements assessment of the criticality of the data processing function to the financial health of your company. If the teleprocessing network is judged to be a critical component, then the cost to recover becomes less of an issue than the time required to recover. Cost of course must be balanced against risk, and that is managements job. Give them the facts and options, and they will select the method best suited to their goals. For example, if a particular network is extremely critical, and must be recovered immediately, fully redundant lines and equipment may be required. Short of full redundancy, recovery time and cost can vary considerably.
If the plan is the first for your company, the expenditures that you propose will be eyeballed for their financial worth. As discussed in the previous article, the management approval process will come down to dollars vs. risk. I suggest that you give management several proposals from which to choose. Include performance expectations as well as cost, and paint a scenario in plain English explaining how each will work. You may want to include testimonial information from vendors or other companies which use similar schemes. Your recommendation should be the one which makes the best sense from an operational as well as cost standpoint.
Following are recovery methods commonly used. The descriptive information will be somewhat abbreviated, but hopefully in enough detail to give you an understanding of the technique. Keep in mind that the methods shown are only one solution, not necessarily the best solution for your network. The various communications and recovery vendors and consultants can offer customized recovery options, and they may have a plan which is better/faster/cheaper than the example.
Please note that the example cover only dedicated networks, and do not include fully redundant schemes. They are organized by network/line type. More complex methods will be discussed in a future article.
DEDICATED NETWORK RECOVERY METHODS TO BE DISCUSSED
- Voice Grade Point-to-Point Lines -- Figure A
- Voice Grade Multipoint Lines -- Figures B, C, & D
- Sub-Rate Digital Lines -- Figures C & E
* Voice Grade Point-to-Point Lines (See Figure A)
The most common and inexpensive method of recovering a failed dedicated point-to-point line is dial backup via a switched network. This scheme can also be used as your disaster recovery method, as long as the alternate computer recovery site is equipped with matching modems and sufficient dial line facilities. Figure A depicts a four wire backup scheme which requires two telephone lines at each end.
Some new modems offer single call dial backup, therefore saving the cost of a telephone line at each end. In the long term, this savings in monthly telephone line rental, and the savings in long distance charges, may offset the higher cost of these modems.
Note that dial backup, while commonly used, had drawbacks. The switched networks usually have a higher bit error rate than dedicated lines. They are not guaranteed for high speed data transmission, and are not guaranteed to be available. A regional disaster could bring about network busy situations. I recommend that you keep access codes for several alternate long distance carriers as part of your plan. This may improve your chances of completing dial backup calls.
* Voice Grade Multipoint Lines (See Figures B, C, & D)
Dial backup can also be used to recover multipoint lines. Dial backup for multipoint lines requires either special bridging arrangements, as shown in Figure B, or alternate master stations. as shown in Figure C. Figure D depicts an alternate master at a hot site using a switching arrangement at the Telco office. Recovering multipoint lines with dial backup can be costly, and may be troublesome, depending on the quality of the switched network lines.
The bridging arrangement depicted in Figure B is most successful when the bridges used are of the new type which cancel noise on bridge legs not transmitting data. This arrangement also works best with modems without sideband diagnostic channels. Each remote site would be called, using two telephone lines, therefore requiring an enormous number of telephone lines to recover a large network. This arrangement is used as a recovery scheme for day to day line failures in many networks, and is offered as a standard product offering by major modem manufacturers. Several hot site vendors offer the bridging arrangement at nominal cost.
The alternate master scheme depicted in Figures C and D can only be used in disaster situations, but can lower the cost if dial backup is not required at each remote site for day-to-day operations. A switching arrangement is required at the common carrier central office. The alternate master leg of the data line should be terminated at a site some distance from the computer center. This can be a cold site, a hot site, or a company in your town which allows you to terminate the alternate master lines on their premises. Figure C shows the alternate master terminating at a hot site. Figure D shows the alternate master at a site in the same city as your computer center, with back-to-back modems used to link the alternate master to the recovery site.
* Sub-Rate Digital Lines (See Figures C and E)
Sub-rate digital in this discussion refers to 2.4, 2.8, and 9.6 Kilobit services. Recovery for digital can be done in the same manner as the analog networks, using dial backup. To use dial backup as depicted in Figure E requires a standby analog modem and a switching arrangement at each site. This can get quite expensive, and is not practical for large networks. The alternate master scheme shown in Figure C can be used for analog or digital.
Part II SUMMARY
The methods discussed are in use and can be applied with variations to many small to medium size networks. Larger networks may require more sophisticated methods.
As mentioned previously, I recommend that you seek your vendors and consultants advice when laying out your teleprocessing recovery plans.
This article adapted from Vol. 1 No. 2, p. 3; No. 3, p. 6.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.