
Up and Running: How to Ensure Disaster Recovery
By Phillip J. Rothstein
You can imagine the movie advertisements: A sea of flames engulfs telco switch... phones dead... even beepers bite the dust... its...
The Telco Switching Center Disaster. Somehow, its difficult to believe that even an all-star cast could make it a box-office hit.
The fact is, the cause of most computer room disasters is far more mundane than the images of towering infernos and devastating
floods conjured up by the word disaster. Nonetheless, when a recent fire damaged a telephone company switch in Hinsdale, Illinois,
business at dozens of Illinois companies was severelydisrupted. While such a fire may not have much dramatic potential, it could
have grave implications for those companies affected.
Unfortunately, most companies are ill-prepared to recover from the typical computer disaster, as mundane as its origins may be.
Indeed, despite the best of intentions, significant investment,and mass quantities of documentation, most disaster recovery plans are
likely to fail just when they are needed most. Despitepositive test results, few plans succeed on their own merits. More often than
not, luck plays as large a role in successful disaster recovery as skill and effort.
Jack Bannan is the manager of information security for General Electric and the cofounder and president of the Delaware Valley
Disaster Recovery Information Exchange, the oldest and perhaps largest user group in this field. He points to a "residual situation...
where plans are written to satisfy auditors or outside accounting firms, and really don't do an effective job. The plans are just put on
a shelf." He admonishes: "Don't just give it lip service."
In the simplest terms, a disaster recovery plan ensures a businesss survival in the face of a traumatic IS disruption. A good disaster
recovery plan, like a good insurance policy, will be most effective if all the risks and threats are carefully and realistically assessed.
Unfortunately for some businesses, this is not always the case.
In the most fundamental of terms, the components most oftenmissing from such plans are commitment and integrity. Answering the
following questions should help you ascertain the viability of your plan in this regard.
At what level in the organization is the commitment to disaster recovery? Is there an explicit, documented, corporate mandate to
protect critical business functions?
In the corporate environment, for disaster recovery to be effective, commitment must come from the highest level and permeate
every area of the organization. If the disaster recovery mandate comes from the C.E.O., President, or Board of Directors, it stands a
much better shot at success than if it originates within IS, audit or another line organization. According to Bannan,
Very few board chairmen, presidents, or general managers would run a business without insurance. And yet [they] dont look at
disaster recovery planning in that same light... or even as a meaningful function.
Is the disaster recovery function adequately funded and staffed or is it constantly struggling to survive?
Many contingency planning/disaster recovery departments are in a constant battle for budget and staffing. In the face of more
glamorous new development projects, disaster recovery often takes a back seat, especially during lean times. While it is perfectly
reasonable to review the cost-effectiveness of the contingency planning function, the disaster recvery plan should not be justified
primarily on the basis of cost-effectiveness, unless it is done in a truly broad sense, just as someone would evaluate insurance
coverage. Justifying a disaster recovery plan within the context of insurance premiums, policy coverage, probability, and the scope
of loss may be particularly effective.
An ongoing commitment of resources and dollars defines the difference between a functional disaster recovery plan and an
ineffectual one. The commitment clearly should include maintenance, testing, and auditing, which are likely to be overshadowed by
the major expenses of a hot-site agreement and offsite media storage.
Was the development and implementation of a disaster recovery plan preceded and based upon a Business Impact Analysis?
There isn't a whole lot of protective value to a disaster recovery plan if it is based upon an incomplete picture of what is being
protected, and of what is likely to be a threat. A business impact analysis thoroughly and objectively examines all of a firms risks
and obligations, identifying and prioritizing critical processes, functions and resources. All too often, the mere survivability of the
data center is the myopic focus of the plan. You have to be aware, however, of how all facets of the business interrelate and what
the role of IS is in relation to them. The business impact analysis process is likely to uncover areas or resources that may not have
been addressed by the disaster recovery plan.
Is Disaster Avoidance an integral aspect of the plan - that is, has there been a sincere effort to ensure that the integrity of the firm is
not unnecessarily compromised?
Very few disaster recovery plans focus directly on Disaster Avoidance, which can minimize the probability of activating the plan in
the first place. Disaster avoidance combines engineering, maintenance, reliability, safety, training, and testing. If effectively
implemented, the disaster avoidance plan will pay handsome dividends through the improved level of reliability and quality brought
to day-to-day business functions, in addition to the reduced exposure to major outages. Another bonus of an aggressive Disaster
Avoidance program is the enhanced ability to recover from a disaster - that is, the recovery process is likely to be a whole lot less
painful.
Are Disaster Recoverability and Disaster Avoidance integral to planning throughout the organization?
The least painful way to achieve a reasonable and appropriate level of recoverability, as well as a prudent, minimal level of risk, is to
include contingency planning in any new business or functional plans. Aside from obvious activities, such as the startup of a new
data center or turnover of a new production application, any substantial functional, technological, and business change warrants a
fresh examination of the exposure to disruption, as well as of the possibility of creating new sources of threat.
Are there adequate, impartial controls and reviews of the disaster recovery plans effectiveness?
The internal or external audit role is crucial to the integrity of the plan. In addition, the use of impartial, external consultants to review
the technical, technological, business, or organizational aspects of the plan may detect weaknesses that are not obvious from within.
Is your disaster recovery plan preceded by a realistic assessment of your needs or has it evolved as a function of vendor offerings?
Many firms elect to use external hot-site vendors that provide access (for a fee) to fully configured backup data centers and even
office facilities. These firms provide a valuable service to many companies. Unfortunately, in all too many cases, the commitment to
a hot-site approach or vendor comes before a full awareness of the business contingency requirements.
It should be clear that a hot-site agreement is only a basic tactic for providing a backup; the focus should first be on what kind of
strategy to use for the disaster recovery plan. It may be that a physical second site is a more appropriate solution for yor business.
Is the plan maintained, updated, and tested continually, effectively and committedly?
Creating a disaster recovery plan without a commitment to periodic testing and ongoing maintenance can actually be worse than
doing nothing at all. There is the tendency to assume that the plan is the companys salvation when disaster strikes, but a poorly
maintained or inadequately tested disaster recovery plan is certain to fail when the going gets tough. Even seemingly obvious aspects
of the plan, such as telephone contact information or configuration details, can quickly become outdated, impeding recovery efforts.
Without exercise, a disaster recovery plan, like the human body, is likely to become flabby and ineffectual.
Where in the organization does the responsibility for Disaster Recovery and Contingency Planning reside?
In the typical corporate setting, disaster recovery is headquartered in the IS organization. The risk to the company, however, is not
confined to IS. The bottom line is this: Survivability of the organization is the face of a catastrophe is the responsibility of every
single employee. The most effective contingency plans are based upon an organizational commitment to integrity and survivability.
This is often initiated by a clear, concise management mandate, which is incorporated into the job descriptions of all employees.
Does the Contingency Planning function have enough clout to rise above the politics and personalities?
Objectivity is critical to the success of a disaster recovery plan. Too often, the politics overshadow the pragmatic considerations of
disaster recovery. In one major Wall Street organization, a small, highly visible group with a potential financial exposure on the order
of $50,000 to $100,000 a day, obtained a commitment to support processing recovery in a matter of seconds after a disruption.
Meanwhile, a bread-and-butter, back-office department with a financial risk considerably over $1 million for each day of an outage
was positioned to recover in a 36- to 48-hour period.
The corollary risk to politics is personality. Face it, in establishing business priorities for recovery, how many employees or
managers would come out and say,
Im not very important? You are dealing with human nature: the me first syndrome can overwhelm what should otherwise be an
orderly procedure. The effective contingency planner will work through the scenario where every process is assumed to be the first
priority.
Is the disaster recovery plan concise, directed, and effective as implemented?
The most effective disaster recovery plans are often the least impressive. One insurance companys contingency planner recently
pointed with pride to five, 3-inch binders containing that companys disaster recovery plan. It is not impossible for a plan that big to
be effective, but it becomes exceedingly difficult to maintain a plan so large and complex.
Clearly, there are benefits of both effectiveness and cost in keeping the plan simple. One of the best ways to do this is by integrating
disaster recovery plan-related functions, responsibilities, and maintenance directly into the day-to-day business environment. For
example, maintenance of the emergency contact information for employees and vendors could be routinely handled as part of the
company phone directory maintenance. Restart/recovery and control information for production processing could be captured at
production turnover of new or modified systems. Management of offsite data backup could be largely automated.
Is the disaster recovery plan activation or declaration process and responsibility explicitly defined?
The best plans are worthless if not activated when calamity strikes. Many disasters do not involve obvious physical destruction.
Some may be essentially invisible, such as the corruption of critical data or a major computer failure. Experience has shown that the
tendency of many professionals, particularly technical and operational personnel in these kinds of situations, is to deny the extent of
a disaster initially: "We'll be back to normal in an hour... maybe another three hours," etc., until time is measured by the calendar,
not the clock.
Declaration of a disaster is a business decision, not a technical decision. Therefore, the individuals responsible for declaring the
disaster should be identified by name and function and the declaration process should be explicitly documented. Clearly, some
flexibility will be built in to this process; the caveat is to ensure that this flexibility isnt fatal. While there is usually a significant, direct
cost - as well as risk - associated with declaring a disaster, odds are that denying the disaster will increase the costs and risk
exponentially.
Upon a disaster declaration, the corporate hierarchy is going to be shaken mightily. Unusual skills, methods, strategies, and
relationships will be needed. The traditional hierarchy simply will not work - a crisis management organizational structure must be
defined explicitly, and that new structure must be empowered through a mandate from the highest level.
Activation of the disaster recovery plan does not necessarily mean, in the case of a hot-site subscription, incurring large vendor
declaration fees. It may be nothing more than advising the vendor to stand by, and beginning the preliminary processes, such as
locating backup media and warning key vendors and staff. However, an understanding of the escalation process and the timing must
be clear to all parties.
Is the human element consciously and explicitly considered in the disaster recovery plan?
Human nature presents many conflicts in an actual disaster, the major implication being unpredictability. Explicitly allowing for the
uncertainty introduced by the human element is the best way to deal with this issue. Providing fallback options is another.
One company's recent experience after a physical disaster exemplifies the human element. One of the key technicians needed for the
initial recovery was contacted by phone. His wife took the call and assured the caller that the technician would be told immediately.
For whatever reason, the wife didnt mention the phone call. As a result, several hours were lost in recovering to a backup site.
A few companies are actually being advised to incorporate an industrial psychologist into their disaster recovery plan development
and testing process. The psychologist can be particularly valuable in attending to the human dimension of disaster recovery, namely,
stress. This can be the result of either physical injury that may have been suffered by others or of the extended, unreasonable
demands placed upon individuals during the recovery process. Fatigue, frustration, anger, denial, resentment, even guilt and
depression, are very real and potentially devastating aspects of recovering from a disaster.
Providing a nurturing and supportive environment for the recovery team can make or break the recovery process. Even the slightest
creature comforts should not be overlooked; individual needs, including support in handling personal or family issues, should be
addressed, preferably through a dedicated staff position.
Does the disaster recovery plan address the management of exceptional risk during the recovery period, as well as restoration of
operations following a disaster?
Most disaster recovery plans focus on the critical initial period of recovery of basic operations following a catastrophe. Once the
initial recovery period is over and the backup-mode operation is reasonably stable, the focus needs to return to restoration - that is,
going back to the way things were before the catastrophe.
The disaster recovery plan should explicitly address the considerations and steps in this reverse process. After all, the transition
back can be as fraught with risk as the precipitous cutover to backup operation had been. Even physical restoration of damaged
premises, documents, media, or equipment should be considered. A further risk during both the recovery and restoration phases is,
simply, too few warm bodies. Key people are stretched to the breaking point; nerves are frayed; more often than not, there simply
arent enough hands to get everything done.
An explicit triage function should be staffed to address damage assessment and salvaging, in parallel o the teams supporting
recovery. This team will be particularly valuable in coordinating the rollback once the crisis has subsided.
Is your Contingency Planning function staffed by professionals?
Frequently, newly appointed contingency planners are former operations, tech support, or line personnel. In any other technological
or business role, training and experience make the difference between success and failure; contingency planning is no exception.
Support contingency planners with training and external consulting; provide opportunities for growth through a contingency
planning user group.
The bottom line is this: whether or not your business exposure is significant, and regardless of the existence or lack of an explicit
disaster recovery plan, it is better to deal with the issues of disaster recovery from a position of knowledge than from one of
assumptions. The it can't happen here mentality is not going to help you or your company when it happens!
Disaster Avoidance: Taking the Preventive Approach
An ounce of disaster prevention may be worth a pound of disaster recovery cure, but fewer than 50 sites nationwide have included
disaster avoidance concepts in their risk-management planning. In most organizations, disaster avoidance is such an obvious issue
that it is everyones responsibility, and yet no one is in charge. Kenneth Brill, president of Computersite Engineering of Cambridge,
Massachusetts, and a pioneer in the emerging field of disaster avoidance, says, "Avoiding a disaster in the first place must be given
an even greater priority," than planning disaster recovery. "Physical disasters don't happen randomly. They are caused by
preexisting, identifiable, disaster-prone conditions... Every data center has physical vulnerabilities which are often unknown to senior
DP management," he warns.
For example, every year, water abruptly shuts down hundreds of sites, sometimes for days at a time. The problem rarely originates
within the computer room, but the computer room is affected because inadequate planning enables the water to get in. Broken pipes,
backed up drains, failed condensate pumps, roof leaks, ground or flood water, or discharging fire sprinklers can deliver hundreds of
gallons of water per minute. Where will it flow? If your computer room is at the low point on the floor, you know where! Lest you
suffer a similar soggy fate, give these questions some thought:
Does your computer room have dams, moats, pumps and alarms?
Do they work?
When was the last time someone checked?
If water were to leak from overhead, are the openings between floors for piping and electrical wiring sealed?
How would you know if water were under your raised floor before an electrical short circuit crashed processing?
How would you get the water out?
Where are the emergency water shutoff valves?
Do you have water pipes that run above the electrical equipment or panels, or above the computer itself?
Do you have tarpaulins to cover equipment?
According to Brills research, over 75% of the sites declaring disasters could have avoided major losses had they had a disaster
avoidance program in place. Brill advocates a multidisciplined, proactive approach to the process of avoiding disaster, which
includes such diverse considerations as engineering and functional design, physical security, fire protection, preventive maintenance,
operational procedures, personnel policies, equipment selection, and so forth - in short, all of the factors that contribute to the
operational reliability and integrity of the data center, as well as to the business areas. He stresses the need for an annual physical
audit in addition to plan review, updating and maintenance.
Clearly, avoiding a corporate heart attack makes a lot more sense than the risk, pain, and expense of an attempt to recover after one
strikes.
Written by Philip Rothstein, President, Rothstein. Article reprinted with permission of DATAMATION. 3 Director Court, Suite 103 Woodbridge, Ont. L4L4S5
(416) 748-1191
This article adapted from Vol. 2 No. 4, p. 36.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.