Up and Running: How to Ensure Disaster Recovery
By Phillip J. Rothstein
You can imagine the movie advertisements: A sea of flames engulfs telco switch... phones dead... even beepers bite the dust... its...
The Telco Switching Center Disaster. Somehow, its difficult to believe that even an all-star cast could make it a box-office hit.
The fact is, the cause of most computer room disasters is far more mundane than the images of towering infernos and devastating floods conjured up by the word disaster. Nonetheless, when a recent fire damaged a telephone company switch in Hinsdale, Illinois, business at dozens of Illinois companies was severelydisrupted. While such a fire may not have much dramatic potential, it could have grave implications for those companies affected.
Unfortunately, most companies are ill-prepared to recover from the typical computer disaster, as mundane as its origins may be. Indeed, despite the best of intentions, significant investment,and mass quantities of documentation, most disaster recovery plans are likely to fail just when they are needed most. Despitepositive test results, few plans succeed on their own merits. More often than not, luck plays as large a role in successful disaster recovery as skill and effort.
Jack Bannan is the manager of information security for General Electric and the cofounder and president of the Delaware Valley Disaster Recovery Information Exchange, the oldest and perhaps largest user group in this field. He points to a "residual situation... where plans are written to satisfy auditors or outside accounting firms, and really don't do an effective job. The plans are just put on a shelf." He admonishes: "Don't just give it lip service."
In the simplest terms, a disaster recovery plan ensures a businesss survival in the face of a traumatic IS disruption. A good disaster recovery plan, like a good insurance policy, will be most effective if all the risks and threats are carefully and realistically assessed. Unfortunately for some businesses, this is not always the case.
In the most fundamental of terms, the components most oftenmissing from such plans are commitment and integrity. Answering the following questions should help you ascertain the viability of your plan in this regard.
At what level in the organization is the commitment to disaster recovery? Is there an explicit, documented, corporate mandate to protect critical business functions?
In the corporate environment, for disaster recovery to be effective, commitment must come from the highest level and permeate every area of the organization. If the disaster recovery mandate comes from the C.E.O., President, or Board of Directors, it stands a much better shot at success than if it originates within IS, audit or another line organization. According to Bannan,
Very few board chairmen, presidents, or general managers would run a business without insurance. And yet [they] dont look at disaster recovery planning in that same light... or even as a meaningful function.
Is the disaster recovery function adequately funded and staffed or is it constantly struggling to survive?
Many contingency planning/disaster recovery departments are in a constant battle for budget and staffing. In the face of more glamorous new development projects, disaster recovery often takes a back seat, especially during lean times. While it is perfectly reasonable to review the cost-effectiveness of the contingency planning function, the disaster recvery plan should not be justified primarily on the basis of cost-effectiveness, unless it is done in a truly broad sense, just as someone would evaluate insurance coverage. Justifying a disaster recovery plan within the context of insurance premiums, policy coverage, probability, and the scope of loss may be particularly effective.
An ongoing commitment of resources and dollars defines the difference between a functional disaster recovery plan and an ineffectual one. The commitment clearly should include maintenance, testing, and auditing, which are likely to be overshadowed by the major expenses of a hot-site agreement and offsite media storage.
Was the development and implementation of a disaster recovery plan preceded and based upon a Business Impact Analysis?
There isn't a whole lot of protective value to a disaster recovery plan if it is based upon an incomplete picture of what is being protected, and of what is likely to be a threat. A business impact analysis thoroughly and objectively examines all of a firms risks and obligations, identifying and prioritizing critical processes, functions and resources. All too often, the mere survivability of the data center is the myopic focus of the plan. You have to be aware, however, of how all facets of the business interrelate and what the role of IS is in relation to them. The business impact analysis process is likely to uncover areas or resources that may not have been addressed by the disaster recovery plan.
Is Disaster Avoidance an integral aspect of the plan - that is, has there been a sincere effort to ensure that the integrity of the firm is not unnecessarily compromised?
Very few disaster recovery plans focus directly on Disaster Avoidance, which can minimize the probability of activating the plan in the first place. Disaster avoidance combines engineering, maintenance, reliability, safety, training, and testing. If effectively implemented, the disaster avoidance plan will pay handsome dividends through the improved level of reliability and quality brought to day-to-day business functions, in addition to the reduced exposure to major outages. Another bonus of an aggressive Disaster Avoidance program is the enhanced ability to recover from a disaster - that is, the recovery process is likely to be a whole lot less painful.
Are Disaster Recoverability and Disaster Avoidance integral to planning throughout the organization?
The least painful way to achieve a reasonable and appropriate level of recoverability, as well as a prudent, minimal level of risk, is to include contingency planning in any new business or functional plans. Aside from obvious activities, such as the startup of a new data center or turnover of a new production application, any substantial functional, technological, and business change warrants a fresh examination of the exposure to disruption, as well as of the possibility of creating new sources of threat.
Are there adequate, impartial controls and reviews of the disaster recovery plans effectiveness?
The internal or external audit role is crucial to the integrity of the plan. In addition, the use of impartial, external consultants to review the technical, technological, business, or organizational aspects of the plan may detect weaknesses that are not obvious from within.
Is your disaster recovery plan preceded by a realistic assessment of your needs or has it evolved as a function of vendor offerings?
Many firms elect to use external hot-site vendors that provide access (for a fee) to fully configured backup data centers and even office facilities. These firms provide a valuable service to many companies. Unfortunately, in all too many cases, the commitment to a hot-site approach or vendor comes before a full awareness of the business contingency requirements.
It should be clear that a hot-site agreement is only a basic tactic for providing a backup; the focus should first be on what kind of strategy to use for the disaster recovery plan. It may be that a physical second site is a more appropriate solution for yor business.
Is the plan maintained, updated, and tested continually, effectively and committedly?
Creating a disaster recovery plan without a commitment to periodic testing and ongoing maintenance can actually be worse than doing nothing at all. There is the tendency to assume that the plan is the companys salvation when disaster strikes, but a poorly maintained or inadequately tested disaster recovery plan is certain to fail when the going gets tough. Even seemingly obvious aspects of the plan, such as telephone contact information or configuration details, can quickly become outdated, impeding recovery efforts. Without exercise, a disaster recovery plan, like the human body, is likely to become flabby and ineffectual.
Where in the organization does the responsibility for Disaster Recovery and Contingency Planning reside?
In the typical corporate setting, disaster recovery is headquartered in the IS organization. The risk to the company, however, is not confined to IS. The bottom line is this: Survivability of the organization is the face of a catastrophe is the responsibility of every single employee. The most effective contingency plans are based upon an organizational commitment to integrity and survivability. This is often initiated by a clear, concise management mandate, which is incorporated into the job descriptions of all employees.
Does the Contingency Planning function have enough clout to rise above the politics and personalities?
Objectivity is critical to the success of a disaster recovery plan. Too often, the politics overshadow the pragmatic considerations of disaster recovery. In one major Wall Street organization, a small, highly visible group with a potential financial exposure on the order of $50,000 to $100,000 a day, obtained a commitment to support processing recovery in a matter of seconds after a disruption. Meanwhile, a bread-and-butter, back-office department with a financial risk considerably over $1 million for each day of an outage was positioned to recover in a 36- to 48-hour period.
The corollary risk to politics is personality. Face it, in establishing business priorities for recovery, how many employees or managers would come out and say,
Im not very important? You are dealing with human nature: the me first syndrome can overwhelm what should otherwise be an orderly procedure. The effective contingency planner will work through the scenario where every process is assumed to be the first priority.
Is the disaster recovery plan concise, directed, and effective as implemented?
The most effective disaster recovery plans are often the least impressive. One insurance companys contingency planner recently pointed with pride to five, 3-inch binders containing that companys disaster recovery plan. It is not impossible for a plan that big to be effective, but it becomes exceedingly difficult to maintain a plan so large and complex.
Clearly, there are benefits of both effectiveness and cost in keeping the plan simple. One of the best ways to do this is by integrating disaster recovery plan-related functions, responsibilities, and maintenance directly into the day-to-day business environment. For example, maintenance of the emergency contact information for employees and vendors could be routinely handled as part of the company phone directory maintenance. Restart/recovery and control information for production processing could be captured at production turnover of new or modified systems. Management of offsite data backup could be largely automated.
Is the disaster recovery plan activation or declaration process and responsibility explicitly defined?
The best plans are worthless if not activated when calamity strikes. Many disasters do not involve obvious physical destruction. Some may be essentially invisible, such as the corruption of critical data or a major computer failure. Experience has shown that the tendency of many professionals, particularly technical and operational personnel in these kinds of situations, is to deny the extent of a disaster initially: "We'll be back to normal in an hour... maybe another three hours," etc., until time is measured by the calendar, not the clock.
Declaration of a disaster is a business decision, not a technical decision. Therefore, the individuals responsible for declaring the disaster should be identified by name and function and the declaration process should be explicitly documented. Clearly, some flexibility will be built in to this process; the caveat is to ensure that this flexibility isnt fatal. While there is usually a significant, direct cost - as well as risk - associated with declaring a disaster, odds are that denying the disaster will increase the costs and risk exponentially.
Upon a disaster declaration, the corporate hierarchy is going to be shaken mightily. Unusual skills, methods, strategies, and relationships will be needed. The traditional hierarchy simply will not work - a crisis management organizational structure must be defined explicitly, and that new structure must be empowered through a mandate from the highest level.
Activation of the disaster recovery plan does not necessarily mean, in the case of a hot-site subscription, incurring large vendor declaration fees. It may be nothing more than advising the vendor to stand by, and beginning the preliminary processes, such as locating backup media and warning key vendors and staff. However, an understanding of the escalation process and the timing must be clear to all parties.
Is the human element consciously and explicitly considered in the disaster recovery plan?
Human nature presents many conflicts in an actual disaster, the major implication being unpredictability. Explicitly allowing for the uncertainty introduced by the human element is the best way to deal with this issue. Providing fallback options is another.
One company's recent experience after a physical disaster exemplifies the human element. One of the key technicians needed for the initial recovery was contacted by phone. His wife took the call and assured the caller that the technician would be told immediately. For whatever reason, the wife didnt mention the phone call. As a result, several hours were lost in recovering to a backup site.
A few companies are actually being advised to incorporate an industrial psychologist into their disaster recovery plan development and testing process. The psychologist can be particularly valuable in attending to the human dimension of disaster recovery, namely, stress. This can be the result of either physical injury that may have been suffered by others or of the extended, unreasonable demands placed upon individuals during the recovery process. Fatigue, frustration, anger, denial, resentment, even guilt and depression, are very real and potentially devastating aspects of recovering from a disaster.
Providing a nurturing and supportive environment for the recovery team can make or break the recovery process. Even the slightest creature comforts should not be overlooked; individual needs, including support in handling personal or family issues, should be addressed, preferably through a dedicated staff position.
Does the disaster recovery plan address the management of exceptional risk during the recovery period, as well as restoration of operations following a disaster?
Most disaster recovery plans focus on the critical initial period of recovery of basic operations following a catastrophe. Once the initial recovery period is over and the backup-mode operation is reasonably stable, the focus needs to return to restoration - that is, going back to the way things were before the catastrophe.
The disaster recovery plan should explicitly address the considerations and steps in this reverse process. After all, the transition back can be as fraught with risk as the precipitous cutover to backup operation had been. Even physical restoration of damaged premises, documents, media, or equipment should be considered. A further risk during both the recovery and restoration phases is, simply, too few warm bodies. Key people are stretched to the breaking point; nerves are frayed; more often than not, there simply arent enough hands to get everything done.
An explicit triage function should be staffed to address damage assessment and salvaging, in parallel o the teams supporting recovery. This team will be particularly valuable in coordinating the rollback once the crisis has subsided.
Is your Contingency Planning function staffed by professionals?
Frequently, newly appointed contingency planners are former operations, tech support, or line personnel. In any other technological or business role, training and experience make the difference between success and failure; contingency planning is no exception. Support contingency planners with training and external consulting; provide opportunities for growth through a contingency planning user group.
The bottom line is this: whether or not your business exposure is significant, and regardless of the existence or lack of an explicit disaster recovery plan, it is better to deal with the issues of disaster recovery from a position of knowledge than from one of assumptions. The it can't happen here mentality is not going to help you or your company when it happens!
Disaster Avoidance: Taking the Preventive Approach
An ounce of disaster prevention may be worth a pound of disaster recovery cure, but fewer than 50 sites nationwide have included disaster avoidance concepts in their risk-management planning. In most organizations, disaster avoidance is such an obvious issue that it is everyones responsibility, and yet no one is in charge. Kenneth Brill, president of Computersite Engineering of Cambridge, Massachusetts, and a pioneer in the emerging field of disaster avoidance, says, "Avoiding a disaster in the first place must be given an even greater priority," than planning disaster recovery. "Physical disasters don't happen randomly. They are caused by preexisting, identifiable, disaster-prone conditions... Every data center has physical vulnerabilities which are often unknown to senior DP management," he warns.
For example, every year, water abruptly shuts down hundreds of sites, sometimes for days at a time. The problem rarely originates within the computer room, but the computer room is affected because inadequate planning enables the water to get in. Broken pipes, backed up drains, failed condensate pumps, roof leaks, ground or flood water, or discharging fire sprinklers can deliver hundreds of gallons of water per minute. Where will it flow? If your computer room is at the low point on the floor, you know where! Lest you suffer a similar soggy fate, give these questions some thought:
Does your computer room have dams, moats, pumps and alarms?
Do they work?
When was the last time someone checked?
If water were to leak from overhead, are the openings between floors for piping and electrical wiring sealed?
How would you know if water were under your raised floor before an electrical short circuit crashed processing?
How would you get the water out?
Where are the emergency water shutoff valves?
Do you have water pipes that run above the electrical equipment or panels, or above the computer itself?
Do you have tarpaulins to cover equipment?
According to Brills research, over 75% of the sites declaring disasters could have avoided major losses had they had a disaster avoidance program in place. Brill advocates a multidisciplined, proactive approach to the process of avoiding disaster, which includes such diverse considerations as engineering and functional design, physical security, fire protection, preventive maintenance, operational procedures, personnel policies, equipment selection, and so forth - in short, all of the factors that contribute to the operational reliability and integrity of the data center, as well as to the business areas. He stresses the need for an annual physical audit in addition to plan review, updating and maintenance.
Clearly, avoiding a corporate heart attack makes a lot more sense than the risk, pain, and expense of an attempt to recover after one strikes.
Written by Philip Rothstein, President, Rothstein. Article reprinted with permission of DATAMATION. 3 Director Court, Suite 103 Woodbridge, Ont. L4L4S5
This article adapted from Vol. 2 No. 4, p. 36.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.