This article is concerned only with planning for disasters. Although many of the activities associated with incident planning are applicable to disaster planning, disasters have been chosen as the focal point because too many organizations see disaster planning as an add-on to incident planning. We believe it is necessary to adopt a viewpoint that evaluates the effects of a disaster first. Analyzing the total system that delivers communications to the organization in this fashion builds a different perspective of its vulnerabilities and, therefore, promotes a more comprehensive approach to the network’s survivability.
The first and most effective step in disaster recovery planning for your network is to prevent the disaster from ever happening. This step is frequently overlooked in the planning process. The security of many of the network components is completely beyond your control. The wires and cables of the lines, the towers, the dishes, and the exchange offices themselves are in the control of the communications carriers.
However, many of the components are within your own premises or under the control of your landlord. These are the components of your network that you should evaluate with respect to security. A simple checklist is presented to assist in the evaluation process:
- All telephone switch rooms and closets should be locked at all times. Keys to the closets should only be given to those truly requiring access.
- All telephone rooms should be as well protected and monitered against threats as computer rooms. Protection should include Halon fire suppression systems, sprinklers, fire and smoke detection, rate of temperature rise monitoring, water detection and monitoring, and intrusion detection.
- Telephone switching equipment and local controllers should not both be in the computer room. This is an unnecessary concentration of risks in one place.
- Modems, controllers, and multiplexers should be subjected to safeguards similar to telephone rooms. Preferably, they should not be placed in open office areas.
- Software and configuration data from programmable telephone switching gear should be backed up regularly and stored offsite.
- Central switchboards should be as well protected from threats as the telephone rooms themselves. Loss of the switchboard can be equally disruptive as the loss of the PBX.
COMPUTER RECOVERY PLANNING
Once the elementary physical precautions have been taken, the real disaster recovery planning can begin. In any network, one or more nodes are driven by some form of computer. In data networks, these computer nodes are the most important.
Good computer recovery planning consists of five critical elements:
1. An Alternate Processing Strategy
The Alternate Processing Strategy dictates how and where processing will continue after a disaster that destroys or denies use of the home computer center. The Alternate Processing Strategy can provide for the required backup computing capacity through:
- a hot-site
- a cold-site (or shell)
- crate and ship replacement
- guaranteed replacement
- vendor replacement
- a development center
- another computer within the organization
- a reciprocal agreement
The strategy (or combination of strategies) chosen is dictated by the speed with which the recovery must happen and the budget available to pay for the guaranteed availability of the facility prior to the disaster.
2. A File Backup and Offsite Storage Program
Fundamental to the ability to recover processing after a disaster is the storage of all data, program, and operating system files in a secure location remote from the data center where they are created and used. It is vital that everything be offsite. A disaster may completely destroy the data center. In such a disaster, every piece of electronic data is critical.
3. A Formal Recovery Plan
In recovering from a disaster, hundreds of activities must be performed by the recovery teams. In large installations, these activities are performed by dozens of team members. Control must be precise and consistent among the various team members.
In small installations, these activities are performed by just a few team members. Control must be equally as precise.
A formal computer recovery plan contains the activities, task assignments, resources, and information to accomplish the recovery without having to invent or innovate.
4. On-going Testing and Administration of the Plan
Once the plan is developed, it must be tested. Testing will uncover the inconsistencies normally found in initial plans. As the computer installation changes, so too must the plan change. A plan designed to recover an obsolete operating system is worse than useless.
5. A Communications Recovery Strategy
The Communications Recovery Strategy is determined exclusively by the nature of the network in place and the Alternate Processing Strategy. For computer recovery planning, a Communications Recovery Strategy cannot be designed without first knowing the Alternate Processing Strategy.
The Communications Recovery Strategy for a data center must provide for the return to service of two distinct groups of users.
The first group is the most obvious. In a disaster that disrupts service from a data center, users of the system that are remotely connected through the production network must be reconnected to the alternate site as soon as demanded by the critical needs of the organization. For example, if remote branches communicate with the central host processor over Dataroute lines and the users must have their service restored within 24 hours, then the Communications Recovery Strategy must provide for the reconnection of those users to the backup site within 24 hours.
The second group of users who must be reconnected to the backup site is less obvious. Users of the data center who are in the same building are locally connected. The Communications Recovery Strategy must provide for these users to be reconnected to the backup site as remote users. In fact, this is the more difficult task in designing a Communications Recovery Strategy. Connecting formerly local users as remote users most frequently requires different controllers and multiplexers in addition to the modems.
OPTIONS FOR RECOVERING REMOTE USERS
Remote users can be connected to their production data center using any of the following services. Options for recovery of these users by connecting to a different backup site are outlined.
Users of dial-up services over the public voice network will need to dial to the backup site only in a disaster. This means that the backup site must have the measured business lines, modems, and software necessary to receive the transmissions.
Note that procedures for users must be explicit and carefully thought out. Many users have no idea where they are dialing, particularly if the dialing procedure is built into the software that they use.
Backup of leased, dedicated services such as DATAROUTE and INFODAT has traditionally been by alternate dialed lines. To achieve this, modems capable of running at the necessary speeds are needed at the remote site and the backup site. This strategy can be combined with an incident recovery strategy that includes like modems at the host site so that failed lines can be dialled back into operation over the public voice network.
It can be cost prohibitive to buy and store modems only for an organization’s disaster recovery plans. Commercial hot-sites have capitalized on this aspect by acquiring their own modems which can then be resold on a term contract basis in conjunction with the cost of the hot-site service. In the event of a disaster, the commercial hot-site will immediately ship the necessary modems to the organization’s remote sites so that the communications recovery can proceed.
With today’s technology, speeds of up to 9600 b.p.s. are reliably obtained. In the recent Penn Mutual disaster supported by SunGard Recovery Services’ Philadelphia Megacenter, most communications links were recovered using SunGard’s own SunNet modems running at 9600 b.p.s. over the public voice network. Although dial-up modems are available running at 19,200 b.p.s., they have yet to be proven reliable enough for extended use in a disaster.
More advanced leased services in Canada are digital and can be readily switched at exchange offices by the communications carriers. T-1 services running through DCC devices are very easy to switch in a disaster. With proper planning preparation, the switch will normally require only a phone call to the carrier’s center to effect the quick switch.
In fact, the less than T-1 bandwidth capabilities of MACH III and MEGASTREAM are particularly suited to disaster recovery. Circuit reconfiguration is very simple using the digital switching capabilities inherent in their delivery. Both services will introduce PC based customer management features that will permit the switching of a production network to its alternate backup configuration in 15 minutes.
Packet Switched Lines
With DATAPAC and INFOSWITCH, communications recovery in a disaster is relatively simple. For dialled lines, it is necessary only to provide a second access point at the backup site to enable a redial by the user in a disaster. For dedicated packet switched lines, the options are a bit more varied. Shared virtual circuits can easily use the alternate access number in the group to get to the backup site. Again, the backup site must have an access point to the packet switched network.
Permanent virtual circuits may have a completely duplicated hot access point at the backup site. Conversely, an organization may choose to wait until the packet switched network is modified through its normal maintenance routine. DATAPAC, for example, is changed each weekend.
With the introduction of Very Small Aperture Terminals (VSAT’S), disaster recovery through satellite became a very real possibility. Even terrestrial networks can be fully backed up through alternate satellite networks. However, the costs of this technology typically outweigh its benefit if disaster recovery is the only planned use. If a network is completely satellite-based, the only additional recovery requirement is for a dish at the backup site.
An interesting alternative is emerging for larger network users. Some companies are moving to remote communications centers, whereby the intelligence for management of the network is located in a center removed from the host computer center. Either through channel extension technology or duplication of the front end, the network is concentrated at the remote communications center and then routed to the host computer center over high speed links. In a disaster, only the high speed link between the remote communications center and the host computer center need be rerouted. Moreover, the variety of links from the remote communications center to various other remote locations are invisible at disaster time. The nature and number of the remote links does not matter. Only the high speed link(s) need be recovered.
OPTIONS FOR RECOVERING LOCAL USERS
There are no very easy answers that provide for the recovery of users who are locally attached. Some of the options are outlined below:
1. An organization can choose to run two cables from each terminal. One would lead from the terminal to the computer room for the local attachment. The other would lead to an opposite corner of the building where a controller or multiplexer with the appropriate modem would be reserved for dialing to the backup site. This is a standard option. Many buildings are not designed to accommodate even the one wire, let alone two.
2. An organization can choose to make all normally local users remote. Simple local loops could be used to connect the controllers or multiplexers on the various floors of the building to the data center. In the event of a disaster, the dial-up capable modems could be used to dial to the backup site. This option, of course, would require the modems and controllers to be located outside the data center.
3. If a commercial hot-site is being used for backup, it is probable that the contract could include the shipment of the necessary modems and controllers or multiplexers to create the dial-up linkage. This option depends on the speed of the commercial hot-site in shipping the modems and controllers.
4. For local area networks that use a gateway to host computer services, a simple dial out capability will permit a link to the backup site at the time of a disaster.
THE FUTURE OF COMMUNICATIONS RECOVERY
Over time, networks have become easier to design and implement. The same is true for communications recovery. Only a few years ago, the only viable recovery option was dial-up backup. Now the options vary with the service used. The options available are easier to design and implement. They are particularly easy to test.
As users of telecommunications services, you can expect improvements in the survivability and recoverability of your networks. The two carriers are showing signs of recognizing the requirement for recoverability. In any event, the job of a disaster recovery planner is becoming somewhat easier.
Michael G. W. Smith is Vice President of Corporate Business Systems, Inc., Toronto, Ontario, Canada.
This article adapted from Vol. 3 No. 1 p. 6.