THE CORPORATE CASTE STRUCTURE
The development, testing, and maintenance of disaster recovery plans are frequently treated as internal, and largely transparent, functions of the I.S. department. The people in the organization that would be most affected by an actual disaster may not even be aware that recovery plans exist.
The people that would be most affected by a disaster are those that are dependent upon computer technology to conduct the day-to-day business of the organization. In computer jargon, these people are labelled “Users,” a somewhat degrading term implying some form of helpless addiction. The term is not inappropriate, however, since any Users deprived of their computer systems can experience withdrawal symptoms of the worst kind.
Within most large organizations, Users occupy the bottom level of a kind of corporate caste structure. Despite the fact that Users are the people that best understand how the organization functions, they are often seen as mere appendages to the organization’s computer systems. For this reason, they are ranked below the Systems People in the caste structure.
Systems People are those individuals who design, implement, operate, and support the organization’s computer technology. They usually hold themselves in very high esteem, and look down upon the poor Users because of their lack of technological sophistication. Unfortunately, while Systems People may know all there is to know about the technology, they often don’t know very much about the business. For this reason, they are often viewed with some suspicion by the people at the top of the caste structure - the Executives.
The Executives are those individuals in senior management positions who wield the overall decision-making authority in the organization. They look down upon both Users and Systems People as necessary evils, however they are usually forced to form an uneasy alliance with the Systems People. The purpose of this alliance, theoretically at least, is to ensure that computer technology is exploited to the organization’s strategic advantage. Sometimes, both the Executives and Systems People lose sight of the fact that this technology must serve the Users, not vice versa.
THE TRADITIONAL APPROACH TO DRP
In most organizations, the impetus for disaster planning comes from the top down, with the Executives mandating the development of recovery plans. This is usually not their own idea, however. Credit generally goes to a group of individuals that sit somewhat outside the corporate caste structure: the Auditors. These organizational pariahs are continual critics, always harping on about such things as the confidentiality, integrity, and availability of computer systems.
The persistence of the Auditors is the reason many organizations have a disaster recovery plan. Their constant nagging, and talk of gloom and doom, eventually forces the Executives to do something to get the Auditors off their backs. What the Executives do, of course, is tell the Systems People to develop a recovery plan. This is usually followed by a prolonged period of negotiations between the Systems People and the Executives over how much to spend on such a plan. Eventually, a budget is set and the Systems People go off to do their thing.
This top down approach would be fine if the Systems People then went off to work with the Users on plan development. This rarely happens, however. Most Systems departments wouldn’t dream of involving the Users in something as technically complex as recovery of computer hardware, communications networks, operating software, application systems, etc. In any event, since the Systems department is now on the nub to deliver a recovery plan as soon as possible, the last thing they want is the aggravation of dealing with Users who would do nothing except slow the process down.
This unfortunately, is a very short-sighted approach. While the absence of User involvement may well hasten the delivery of a recovery plan, the end product may not be worth the paper it is written on.
IMPLICATIONS FOR RECOVERY
Why is User involvement in recovery planning critical? Well suppose, heaven forbid, that the Systems People actually have to invoke the recovery plan. If this is the first time the Users discover that such a thing exists, it may come as quite a shock.
To illustrate, let’s look at four hypothetical disaster scenarios: the transparent disaster, the almost transparent disaster, the visible disaster, and the highly visible disaster.
1. The Transparent Disaster--Let’s suppose it’s Friday evening. The Users have all gone home for a long holiday weekend. The I.S. department has shut down all the on-line systems, run all the overnight batch jobs, and taken full backups of all the software and data. Moments after shipping the backups off-site, disaster strikes the data center, putting it out of commission.
The recovery teams are notified immediately, and spring into action. Working around the clock, the teams recover all the applications at the alternate processing facility and reroute the data communications networks to the hot site. When the Users come back to work on Tuesday morning, refreshed from their long weekend, they sit down at their terminals and start working as if nothing had happened.
In this scenario, the fact that the Users know nothing about the I.S. department’s disaster recovery plan is immaterial, since the entire exercise was transparent to them. Unfortunately, this scenario is a pipe dream. Disasters are not this neat. Building a recovery plan on the assumption that a disaster will strike at the best possible time is suicide.
2. The Almost Transparent Disaster--Let’s look at what could happen if we change a few parameters of the disaster. First of all, let’s assume that everything is the same as in the first scenario, with one exception: disaster strikes before the latest backups are shipped off-site. The recovery teams still spring into action, and recover all the applications, but using previous backups which are a day or a week out of date, depending on the backup cycle.
Now when the Users return to work Tuesday morning, it is soon apparent that there is something not quite right with their applications. The phones start ringing in the I.S. department, confusion erupts in the User community, processing errors begin to be made, and the nice, neat recovery effort starts falling apart. Without a clear understanding of what has happened, or what they are supposed to do about it, the Users make erratic and unpredictable attempts to fix up their systems. After a day or two of these uncoordinated activities, business operations are in a shambles: transactions have been lost or duplicated, files are out of whack, the systems are spewing out error messages, and normally routine procedures are becoming chaotic. The original disaster now seems trivial in comparison to the aftermath.
3. The Visible Disaster--Let’s look at another scenario. It’s the middle of the week. The Users are hammering away at their computer terminals, when they suddenly go dead (the terminals that is, not the Users). After banging on their terminals for a while, the Users start calling the Help Desk, only to get a recorded message that “service has been temporarily disrupted”. For the next hour or so, the Users sit around cursing the Systems department, until word starts coming down the grapevine that some serious calamity has struck the data center. The balance of the day is spent in confusion and frantic phone calls to the Director of MIS. Eventually, the official position of the MIS department gets out: “We have activated our disaster recovery plan. We’ll get back to you in a couple of days.”
In the MIS department, the recovery teams have swung into action and are going through their paces with carefully rehearsed precision. After 48 hours of around the clock activity, the systems are brought up at the alternate facility, and the exhausted recovery teams go home. Meanwhile, back at the ranch, it has been a totally different story. The Users had no carefully rehearsed plans to fall back on. For the most part, they had no plans at all for losing their systems for this length of time. Some areas simply sat around idly waiting for the systems to come back up; others improvised “workarounds” on the fly; still others tried to re-activate long dormant manual procedures. What no one did was figure out what they would do when the systems did come back up.
The successful recovery effort performed by the Systems People did not, unfortunately, result in a return to “business as usual.” The systems came back up alright, late in the week, but no one had explained to the Users that they would now have to: a) reprocess the transactions lost during the time period between the last backup and the point of failure; b) process the backlog of transactions that had built up during the recovery period; and c) reconcile the systems with any manual processing, or “workarounds,” that had been performed while the systems were down. Since the recovery teams had all gone home, there was no one around to help them sort out the mess. In despair, the Users simply went home for the weekend, hoping that Monday would never come. Unfortunately, it always does.
4. The Highly Visible Disaster--There is one more scenario which needs to be presented. This is the one that people responsible for computer recovery plans rarely address, or are rarely allowed to address: the disaster that directly affects the User’s working environment, as well as the data center. In this scenario, the data center and the Users are all in the same building. When the disaster strikes, the Users not only lose their systems, they lose access to their offices, their terminals, their PCs, their phones, their faxes, their files, their work-in-progress - everything! Sure, the computer recovery plan still works, but when the recovery teams have finished recovering the systems, there’s no one there to use them! All the Users are at home, or wandering aimlessly around the streets.
This type of disaster is obviously highly visible to the User community. They may actually have watched their offices burn to the ground on the evening news. You can rest assured that their first reaction wasn’t “Oh my God! We’ve lost our computer systems.” No doubt their only thoughts were about when and where they would return to work, if indeed they ever would return to work. Since the Users had never been involved in any aspects of disaster planning, the thought that something like this could happen to them had probably never crossed their minds. Certainly no one had given any thought to how the Users’ office facilities could be replaced in the event of a disaster.
The Systems People are no help in this scenario. They are too concerned with their precious computer systems to worry about mundane things like arranging for alternate office space and furnishings, replacing phones and office equipment, recovering lost documents and work-in-progress, installing new terminals and PCs, etc. Presumably, they felt that it wasn’t their job to worry about these aspects of disaster planning. But if it wasn’t their job, whose was it? The purpose of disaster planning is not merely to recover computer systems, but to ensure that the organization can survive a disaster. Without plans that address any eventuality, including this “worst case” scenario, this assurance cannot be given.
Part II of III
In Part I of this article, we looked at four hypothetical disaster scenarios. In three of these scenarios, something was missing from the Recovery Plan. In each case, the plan only addressed what the Systems People would be doing after a disaster, not what the Users would, or should, be doing. Since User involvement was the critical missing element in the Recovery Planning exercise, this should not be a great surprise to anyone. To identify the specific components of the Recovery Plan that were overlooked, and to highlight the criticality of User involvement, each of the four scenarios will be re-examined in detail.
In the first scenario, the disaster occurred on a Friday evening, after all the Users had gone home for a long weekend, and after full backups had been taken off site. In this scenario, nothing is missing from the Recovery Plan. The off-site backups were taken after the Users had finished their normal processing for the week, the disaster occurred while no processing was taking place, and the systems were restored before the Users needed to resume normal processing. The timeline for the recovery looks pretty simple. (See Exhibit 1).
In the second scenario, the disaster also occurred on a Friday evening before a long weekend, but before backups had been taken off site. In this scenario, and in most real-life disasters, a certain amount of User processing is lost when the systems are recovered. Unless the organization can afford to take continuous off-site backups, via electronic vaulting, the data that is recovered will not be completely up to date. This means that the Users must re-process all of their lost transactions in order to restore their systems to a usable state. (See Exhibit 2).
It is absolutely critical that detailed plans for this process of System Restoration be developed and tested. Otherwise, the recovery time period may be extended well beyond acceptable limits, and the integrity of the systems may be seriously jeopardized. For each system, there are essentially three requirements that must be addressed:
1. User Notification - Procedures must be in place to ensure that key User personnel are notified when the Recovery Plan is invoked. These Users would have to arrange for the required staff to be on site as soon as the system is recovered, and would then have to coordinate the System Restoration.
2. Transaction Identification - The User personnel must be able to identify those transactions that were processed after the last off-site backup. This requires that workflows in the User departments be examined, and possibly modified. It also requires that the Users fully understand the off-site backup cycles.
3. Transaction Re-processing - The Users must have routines for re-processing the lost transactions as expeditiously as possible. These are not normal system routines, and must be developed in conjunction with the Users to ensure that they can restore the system to its pre-disaster state without duplication of any completed processes (e.g. re-processing of an Accounts Payable transaction shouldn’t generate another payment if one had already been made).
The process of System Restoration is not something that can be worked out “on the fly” after a disaster has occurred. It must be planned, documented, and tested well in advance. System interdependencies must be identified, and the overall effort must be carefully coordinated to ensure that all affected systems are brought back on stream as cleanly and quickly as possible. All of this can only be accomplished with active User involvement in the Recovery Planning process.
In the third scenario, the disaster occurred during an active business cycle. In most organizations, it is not acceptable to simply shut down shop for a few days until the systems are restored. Critical business functions must continue, in one fashion or another, during the disruption in service. Consequently, the Users must be able to bypass the computer systems and perform these functions an alternate way. These Bypass Procedures must be developed, on a system by system basis, as follows:
1. Identify Critical Functions - The Users must determine the business functions which need to continue during the outage.
2. Document Bypass Procedures - The Users must develop, and document in detail, the processes to be followed to perform these critical functions while the systems are unavailable. Depending on the circumstances, these may be manual processes, PC-based processes, etc.
3. Modify Current Systems - Systems People typically design systems to maximize the Users’ dependence on those systems. The Users must work with the Systems People to ensure that sufficient flexibility is built into the systems to make the Bypass Procedures viable. For example, if the Bypass Procedures are manual, the information required to support those procedures must be available to the Users in hardcopy form, even if it is normally only available on-line. Similarly, if the procedures are PC-based, the information must be available on the PC, even if it is normally only available on the mainframe.
Development of Bypass Procedures is rarely given the attention it deserves, by either Users or Systems People. Users have a tendency to underestimate their dependence on computer systems, and hence overestimate their ability to function in the absence of those systems. Systems People, on the other hand, seem to have an abhorrence of any form of manual processing: if they can’t program it, they don’t want anything to do with it. Both of these tendencies must be overcome. Bypass Procedures can’t be addressed as an afterthought to the development of a computer system; they need to be considered part of the overall system development process, and must be tested and maintained with the same rigor applied to the computer programs.
Taking a rigorous approach is also important in order to avoid a common oversight with Bypass Procedures. It is not enough to have a workable bypass; the Users must have a bypass that is workable and that permits an orderly return to normal processing after the systems have been restored. In other words, the Users must be able to reconcile the restored systems with the manual, or PC-assisted, processing that was performed while the systems were unavailable. This may seem obvious, but it is often overlooked or left to the Users to figure out after a disaster, with the result that the return to normal may be much longer than necessary.
As we can see from Exhibit 3, the recovery timeline is starting to get somewhat complicated. It is also clear that the Users are going to be very busy during the recovery time period. Not only will they need to invoke their Bypass Procedures to ensure the continuity of critical business functions, they will also need to re-process their lost transactions once the systems have been recovered, and then reconcile the restored systems with the processing that has been taking place in bypass mode. Obviously, failure to involve the Users in planning and testing these processes before a disaster takes place, can have disastrous consequences after.
The fourth, and final scenario added one more major wrinkle to the recovery process. In this scenario, the Users did not just lose their computer systems, they lost access to their place of business. As important as it is to recover these systems, it is even more important to find replacement office facilities for the Users so that they can begin to maintain critical business functions.
The window between occurrence of the disaster and replacement of the Users’ office facilities is now the most critical part of the recovery timeline. (See Exhibit 4). Until the Users are able to return to work, no business can be conducted, even in bypass mode. This highlights what should have been obvious all along: the most critical components in a Disaster Recovery Plan are not the computer systems, but the Users. Unless your business has been so highly automated that it does not require this ‘human element’ to function, simply recovering the technology will not return the organization to normal operation.
On one level, development of Facility Replacement plans is relatively straightforward. Alternate office space must be identified, an immediate source of office furnishings and supplies must be determined, a plan for installing replacement terminals and PC’s must be in place, and a strategy for re-instating voice communications must be established. While these are not trivial matters, they only involve logistical and financial considerations. Nevertheless, these plans must be carefully documented and maintained and the various procedures reviewed and rehearsed, by those responsible, on a regular basis.
Unfortunately, there is another, more complex level to Facility Replacement planning: the human level. People (i.e. Users) tend to be creatures of habit. They get up at the same times each morning, take the same routes to work, sit down at the same desks, reach into the same drawers for their work-in-progress, and begin following the same daily routines. In the event of a disaster, it is bad enough that their normal routines have been disrupted by the loss of their computer systems. When this is compounded by unfamiliar surroundings and the loss of their personal files, notes, reference material, etc., it can be positively traumatizing. This can be avoided, or at least minimized, however, by involving the Users in the planning and rehearsal processes, to give them an opportunity to prepare themselves psychologically for such an eventuality.
User involvement is also essential to ensure that one final aspect of Facility Replacement planning is properly addressed: Document Retrieval. In the worst case scenario of loss of both office facilities and computer systems, the Users may be expected to maintain critical business functions without access either to the information in their computer systems, or in the hard copy documentation retained at their normal place of work. Unless they have extremely good memories, this can be an impossible task. Consequently, the Facility Replacement plans must include provision for retrieval, from an alternate source, of backup copies of all essential documentation.
The only people that can properly assess what documentation is essential are the Users. Hence, this is a task which must be assigned to them. It is an interesting task since it forces the Users to take a very close look at what they do, why they do it, and how they do it. The insights gained from this exercise can prove to be extremely beneficial, not just to Disaster Recovery, but to normal operations. In any event, identification of the critical documentation is the easy part. The hard part, which requires collaboration between the Users and Systems People, is development of strategies for backing up and retrieving this documentation. This will necessitate procedures for duplicating the documentation and storing it off site, either in hard copy format, or on microfiche, diskette, or other readily useable media.
PART III OF III
In Part II of this article, we discussed why user involvement in the Recovery Planning process is mandatory. The unanswered questions, however, are when to involve them, and how to involve them. In a perfect world, Recovery Planning is not something you do after a system has been developed, it is something you do while the system is being developed. The development and implementation of a new computer system is based on the successful collaboration of Users and Systems People, with the Users providing the business knowledge and the Systems People the technological knowledge. This, then, is the ideal time to involve the Users in the Recovery Planning process. It requires, however, that both parties recognize that the purpose of their collaboration is not just to deliver a functionally and technically sound system, but to deliver a system which can be quickly restored in the event of loss, and which can be temporarily bypassed in order to maintain critical business functions. To deliver such a system, additional discipline will have to be incorporated into each phase of the development cycle.
Before system design even begins, the Users will have to specify one simple parameter that will be critical to the entire process: the Disaster Window. The Disaster Window is the maximum length of time, between loss of a system and its return to normal operation, that can be tolerated without seriously jeopardizing critical business functions. It must be recognized that this window is not merely the length of time that it will take the Systems People to recover the system, since it must also include the leadtime before the Recovery Plan is invoked, and the time it will take the Users to restore and reconcile the system. It will also have to allow for any leadtime for Facility Replacement.
It is important that the Users, not the Systems People, determine the Disaster Window, since they are the ones qualified to assess the business impact of the system’s unavailability. They are also the ones who will have to perform the Bypass Procedures, and the System Restoration and Reconciliation procedures. It will be necessary for the Users to make trade-offs in setting this window: the shorter the window, the less disruption to normal operations; the longer the window, the greater the safety margin for the overall recovery effort. Obviously, the critical business functions which will need to be maintained during the Disaster Window must also be clearly specified.
With the Disaster Window and critical business functions specified up front, the system can be designed accordingly. The risk of making a bad design decision, from a Disaster Recovery perspective, is minimized. It also forces the Users to participate actively in the design process, something which does not always happen in the traditional approach to system development. Some of the key design issues to be resolved jointly by the Users and Systems People are reviewed below.
1. Backup Cycles
Systems People are very good at designing backup and retrieval mechanisms for system data. However, the focus is usually on the ability to recover from processing errors. With the Users’ involvement, backup and retrieval mechanisms can be designed which focus on the ability to recover from a disaster. This requires establishment of backup cycles based on the requirements of the business, not just the requirements of the system. It must also be recognized that establishment of the backup cycles is not necessarily limited to data contained in the computer system. Critical data contained in the Users’ desks, filing cabinets, PC’s, etc. will also need to be backed up offsite.
2. Transaction Processing
Traditionally, systems are designed to process transactions one way: the “normal” way. In a disaster scenario, transactions may have to be processed in an “abnormal” way, in order to reprocess lost transactions or reconcile the restored system with the processing that was occurring in Bypass mode. With the Users’ involvement, the ability to perform this abnormal processing can be built right intothe system design, rather than attempting to develop “add on” routines at a later date.
3. Batch Processing
In addition to processing that occurs ‘on-line’, most systems also do routine ‘batch’ processing based on normal business cycles (e.g. daily, weekly, monthly, etc.). When recovering from a disaster, special batch processing may have to be performed to get the systems back ‘in sync’ with the business cycles. Provision for this should be built into a system’s design.
4. System Dependence
Systems People have a tendency to “over-automate” even the most trivial of functions. As a result, Users are frequently locked into a total dependence on the system’s availability unnecessarily. With the Users’ involvement in design decisions, they can ensure that they will maintain a measure of control over the system, and retain the flexibility to perform routine (but often critical) functions without computer assistance. Often this is as simple as ensuring that the Users can look up essential system information via printed reports, microfiche, etc. in the event that on-line access to the information is unavailable.
5. Bypass Procedures
Traditionally, no one develops a means of bypassing a computer system until long after it has been developed. This is the hard way. Bypass procedures should be designed at the same time as the computer systems, since the design of the Bypass Procedures can affect the design of the computer system, and vice versa. Doing both at the same time allows trade-offs to be evaluated, and helps guard against the unnecessary system dependencies mentioned above. The principal responsibility for design of the Bypass Procedures belongs with the Users, since they are far better qualified to determine simple, but workable, means of maintaining critical business functions during the Disaster Window.
6. Distributed Processing
Increasingly, systems development is moving away from the traditional ‘mainframe’ towards processing that is distributed throughout ‘client-server’ networks. In many organizations, new systems tend to be a hybrid of centralized processing on the mainframe and localized processing on PC/LAN’s and/or midrange platforms. Designing a system which does part of its processing centrally, and part locally, can insulate the Users somewhat from the impact of a disaster affecting the central site. User involvement in determining what processing should be done locally, and what data should be maintained locally, is essential. Before pursuing this option, however, two things must be remembered: first, a Recovery Plan will have to be developed for the local system as well as the central system; and, second, if the local system and central system are in the same geographical area,they may both be taken out by the same disaster.
DEVELOPMENT AND TESTING
In the traditional cycle, Users have little involvement in the actual development of the system, but do get involved in testing out the various components as they are delivered to them by the Systems People. This is to ensure that the components function as expected. Of course, they rarely do, with the result that they must be sent back to the Systems People for rework.
In the development cycle being espoused here, an interesting symbiotic relationship occurs. While the Systems People are developing the computerized routines, the Users are developing the bypass routines. Each side has to know what the other is doing in order to keep their efforts and approaches “in sync”. The checkout of one set of routines has to be matched to the checkout of the other set of routines. Hence, the development and testing of the system becomes a much more collaborative effort, likely producing a better system.
Needless to say, the Users must also be actively involved in development and testing of the various backup and retrieval mechanisms, and any specialized routines required for the reprocessing of lost transactions and reconciliation of the system after a disaster. Since these routines will only be exercised during real or simulated disasters, it is important that the Users prepare detailed instructions on how to use these routines (these instructions must, of course, be backed up offsite).
IMPLEMENTATION AND SHAKEDOWN
Implementation of a new system can be a traumatic time for Users and Systems People alike. It marks a transition from one way of conducting business to another, and, no matter how thorough the testing, the transition rarely goes as smoothly as planned. However, a system developed via the approach outlined here can be implemented with less trauma, and less risk of disrupting the business during the implementation, than a system developed via the traditional approach. It is likely, first of all, that increased User involvement in the development cycle will have produced a better system, with less unanticipated “glitches”. Secondly, the Users will be well positioned to cope with any glitches that do arise, since they need only fall back on their Bypass Procedures while the Systems People make the necessary corrections to the system.
It is almost to be hoped that glitches do arise, since it is essentia