
A Failure to Communicate
By Michael G. W. Smith
The most prominent feature of the Information Age has been the marriage of computers and telecommunications. This is a positive
union that supports revolutionary innovations in information technology, office automation, and business practices.
Not all the effects have been positive, however. The sophisticated abilities of data concentration and transmission to points
anywhere on earth are affected by the complete reliance on telecommunications connectivity. Telecommunications is deliberately
engineered to be transparent to users. The various levels of a communications architecture are hidden from ordinary users. A
companys technicians deal only with the level presented to them. They do not need to know what lies beyond.
Even in the best of circumstances, planning the installation of a new network is a demanding and time consuming task. Planning for
disasters is even more complicated.
A communications network tends to be taken for granted until an incident or disaster occurs that stops the network from doing its
job.
At this point, it is necessary to define some terms.
Incident: A disruption in service that lasts less than 24 hours.
Disaster: A disruption in service that lasts more than 24 hours.
This article is concerned only with planning for disasters. Although many of the activities associated with incident planning are
applicable to disaster planning, disasters have been chosen as the focal point because too many organizations see disaster planning
as an add-on to incident planning. We believe it is necessary to adopt a viewpoint that evaluates the effects of a disaster first.
Analyzing the total system that delivers communications to the organization in this fashion builds a different perspective of its
vulnerabilities and, therefore, promotes a more comprehensive approach to the networks survivability.
DISASTER PREVENTION
The first and most effective step in disaster recovery planning for your network is to prevent the disaster from ever happening. This
step is frequently overlooked in the planning process. The security of many of the network components is completely beyond your
control. The wires and cables of the lines, the towers, the dishes, and the exchange offices themselves are in the control of the
communications carriers.
However, many of the components are within your own premises or under the control of your landlord. These are the components
of your network that you should evaluate with respect to security. A simple checklist is presented to assist in the evaluation process:
* All telephone switch rooms and closets should be locked at all times. Keys to the closets should only be given to those truly
requiring access.
* All telephone rooms should be as well protected and monitered against threats as computer rooms. Protection should include
Halon fire suppression systems, sprinklers, fire and smoke detection, rate of temperature rise monitoring, water detection and
monitoring, and intrusion detection.
* Telephone switching equipment and local controllers should not both be in the computer room. This is an unnecessary
concentration of risks in one place.
* Modems, controllers, and multiplexers should be subjected to safeguards similar to telephone rooms. Preferably, they should not
be placed in open office areas.
* Software and configuration data from programmable telephone switching gear should be backed up regularly and stored offsite.
* Central switchboards should be as well protected from threats as the telephone rooms themselves. Loss of the switchboard can
be equally disruptive as the loss of the PBX.
COMPUTER RECOVERY PLANNING
Once the elementary physical precautions have been taken, the real disaster recovery planning can begin. In any network, one or
more nodes are driven by some form of computer. In data networks, these computer nodes are the most important.
Good computer recovery planning consists of five critical elements:
1. An Alternate Processing Strategy
The Alternate Processing Strategy dictates how and where processing will continue after a disaster that destroys or denies use of the
home computer center. The Alternate Processing Strategy can provide for the required backup computing capacity through:
* a hot-site
* a cold-site (or shell)
* crate and ship replacement
* guaranteed replacement
* vendor replacement
* a development center
* another computer within the organization
* a reciprocal agreement
The strategy (or combination of strategies) chosen is dictated by the speed with which the recovery must happen and the budget
available to pay for the guaranteed availability of the facility prior to the disaster.
2. A File Backup and Offsite Storage Program
Fundamental to the ability to recover processing after a disaster is the storage of all data, program, and operating system files in a
secure location remote from the data center where they are created and used. It is vital that everything be offsite. A disaster may
completely destroy the data center. In such a disaster, every piece of electronic data is critical.
3. A Formal Recovery Plan
In recovering from a disaster, hundreds of activities must be performed by the recovery teams. In large installations, these activities
are performed by dozens of team members. Control must be precise and consistent among the various team members.
In small installations, these activities are performed by just a few team members. Control must be equally as precise.
A formal computer recovery plan contains the activities, task assignments, resources, and information to accomplish the recovery
without having to invent or innovate.
4. On-going Testing and Administration of the Plan
Once the plan is developed, it must be tested. Testing will uncover the inconsistencies normally found in initial plans. As the
computer installation changes, so too must the plan change. A plan designed to recover an obsolete operating system is worse than
useless.
5. A Communications Recovery Strategy
The Communications Recovery Strategy is determined exclusively by the nature of the network in place and the Alternate
Processing Strategy. For computer recovery planning, a Communications Recovery Strategy cannot be designed without first
knowing the Alternate Processing Strategy.
The Communications Recovery Strategy for a data center must provide for the return to service of two distinct groups of users.
The first group is the most obvious. In a disaster that disrupts service from a data center, users of the system that are remotely
connected through the production network must be reconnected to the alternate site as soon as demanded by the critical needs of
the organization. For example, if remote branches communicate with the central host processor over Dataroute lines and the users
must have their service restored within 24 hours, then the Communications Recovery Strategy must provide for the reconnection of
those users to the backup site within 24 hours.
The second group of users who must be reconnected to the backup site is less obvious. Users of the data center who are in the
same building are locally connected. The Communications Recovery Strategy must provide for these users to be reconnected to the
backup site as remote users. In fact, this is the more difficult task in designing a Communications Recovery Strategy. Connecting
formerly local users as remote users most frequently requires different controllers and multiplexers in addition to the modems.
OPTIONS FOR RECOVERING REMOTE USERS
Remote users can be connected to their production data center using any of the following services. Options for recovery of these
users by connecting to a different backup site are outlined.
Dial-up Lines
Users of dial-up services over the public voice network will need to dial to the backup site only in a disaster. This means that the
backup site must have the measured business lines, modems, and software necessary to receive the transmissions.
Note that procedures for users must be explicit and carefully thought out. Many users have no idea where they are dialing,
particularly if the dialing procedure is built into the software that they use.
Leased Lines
Backup of leased, dedicated services such as DATAROUTE and INFODAT has traditionally been by alternate dialed lines. To
achieve this, modems capable of running at the necessary speeds are needed at the remote site and the backup site. This strategy
can be combined with an incident recovery strategy that includes like modems at the host site so that failed lines can be dialled back
into operation over the public voice network.
It can be cost prohibitive to buy and store modems only for an organizations disaster recovery plans. Commercial hot-sites have
capitalized on this aspect by acquiring their own modems which can then be resold on a term contract basis in conjunction with the
cost of the hot-site service. In the event of a disaster, the commercial hot-site will immediately ship the necessary modems to the
organizations remote sites so that the communications recovery can proceed.
With todays technology, speeds of up to 9600 b.p.s. are reliably obtained. In the recent Penn Mutual disaster supported by
SunGard Recovery Services Philadelphia Megacenter, most communications links were recovered using SunGards own SunNet
modems running at 9600 b.p.s. over the public voice network. Although dial-up modems are available running at 19,200 b.p.s., they
have yet to be proven reliable enough for extended use in a disaster.
More advanced leased services in Canada are digital and can be readily switched at exchange offices by the communications
carriers. T-1 services running through DCC devices are very easy to switch in a disaster. With proper planning preparation, the
switch will normally require only a phone call to the carriers center to effect the quick switch.
In fact, the less than T-1 bandwidth capabilities of MACH III and MEGASTREAM are particularly suited to disaster recovery.
Circuit reconfiguration is very simple using the digital switching capabilities inherent in their delivery. Both services will introduce PC
based customer management features that will permit the switching of a production network to its alternate backup configuration in
15 minutes.
Packet Switched Lines
With DATAPAC and INFOSWITCH, communications recovery in a disaster is relatively simple. For dialled lines, it is necessary
only to provide a second access point at the backup site to enable a redial by the user in a disaster. For dedicated packet switched
lines, the options are a bit more varied. Shared virtual circuits can easily use the alternate access number in the group to get to the
backup site. Again, the backup site must have an access point to the packet switched network.
Permanent virtual circuits may have a completely duplicated hot access point at the backup site. Conversely, an organization may
choose to wait until the packet switched network is modified through its normal maintenance routine. DATAPAC, for example, is
changed each weekend.
Satellite Communications
With the introduction of Very Small Aperture Terminals (VSATS), disaster recovery through satellite became a very real
possibility. Even terrestrial networks can be fully backed up through alternate satellite networks. However, the costs of this
technology typically outweigh its benefit if disaster recovery is the only planned use. If a network is completely satellite-based, the
only additional recovery requirement is for a dish at the backup site.
An interesting alternative is emerging for larger network users. Some companies are moving to remote communications centers,
whereby the intelligence for management of the network is located in a center removed from the host computer center. Either
through channel extension technology or duplication of the front end, the network is concentrated at the remote communications
center and then routed to the host computer center over high speed links. In a disaster, only the high speed link between the remote
communications center and the host computer center need be rerouted. Moreover, the variety of links from the remote
communications center to various other remote locations are invisible at disaster time. The nature and number of the remote links
does not matter. Only the high speed link(s) need be recovered.
OPTIONS FOR RECOVERING LOCAL USERS
There are no very easy answers that provide for the recovery of users who are locally attached. Some of the options are outlined
below:
1. An organization can choose to run two cables from each terminal. One would lead from the terminal to the computer room for the
local attachment. The other would lead to an opposite corner of the building where a controller or multiplexer with the appropriate
modem would be reserved for dialing to the backup site. This is a standard option. Many buildings are not designed to
accommodate even the one wire, let alone two.
2. An organization can choose to make all normally local users remote. Simple local loops could be used to connect the controllers
or multiplexers on the various floors of the building to the data center. In the event of a disaster, the dial-up capable modems could
be used to dial to the backup site. This option, of course, would require the modems and controllers to be located outside the data
center.
3. If a commercial hot-site is being used for backup, it is probable that the contract could include the shipment of the necessary
modems and controllers or multiplexers to create the dial-up linkage. This option depends on the speed of the commercial hot-site
in shipping the modems and controllers.
4. For local area networks that use a gateway to host computer services, a simple dial out capability will permit a link to the backup
site at the time of a disaster.
THE FUTURE OF COMMUNICATIONS RECOVERY
Over time, networks have become easier to design and implement. The same is true for communications recovery. Only a few years
ago, the only viable recovery option was dial-up backup. Now the options vary with the service used. The options available are
easier to design and implement. They are particularly easy to test.
As users of telecommunications services, you can expect improvements in the survivability and recoverability of your networks.
The two carriers are showing signs of recognizing the requirement for recoverability. In any event, the job of a disaster recovery
planner is becoming somewhat easier.
Michael G. W. Smith is Vice President of Corporate Business Systems, Inc., Toronto, Ontario, Canada.
This article adapted from Vol. 3 No. 1 p. 6.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.