USER INVOLVEMENT IN RECOVERY PLANNING
By David A. Johnson
Part I of III
The impetus for recovery planning usually comes from the top down, with an organizations senior management mandating the development of a computer recovery plan by the Information Systems Department. All too often, the plan is considered complete after all the technological issues have been addressed, without considering the implications to the people who use the technology. Consequently, the plan may not ensure the organizations ability to continue critical business functions following a disaster.
Development and testing of recovery plans cannot be confined to the inner sanctum of the I.S. department, but needs to spread throughout the organization, actively involving the users of computer systems. This involvement can, and should, occur during the development of detailed recovery plans for existing systems, and during regularly scheduled testing of the overall recovery effort. Most importantly, however, a new discipline must be introduced into the system development cycle that ensures user involvement in recovery planning from the very beginning of the process.
THE CORPORATE CASTE STRUCTURE
The development, testing, and maintenance of disaster recovery plans are frequently treated as internal, and largely transparent, functions of the I.S. department. The people in the organization that would be most affected by an actual disaster may not even be aware that recovery plans exist.
The people that would be most affected by a disaster are those that are dependent upon computer technology to conduct the day-to-day business of the organization. In computer jargon, these people are labelled Users, a somewhat degrading term implying some form of helpless addiction. The term is not inappropriate, however, since any Users deprived of their computer systems can experience withdrawal symptoms of the worst kind.
Within most large organizations, Users occupy the bottom level of a kind of corporate caste structure. Despite the fact that Users are the people that best understand how the organization functions, they are often seen as mere appendages to the organizations computer systems. For this reason, they are ranked below the Systems People in the caste structure.
Systems People are those individuals who design, implement, operate, and support the organizations computer technology. They usually hold themselves in very high esteem, and look down upon the poor Users because of their lack of technological sophistication. Unfortunately, while Systems People may know all there is to know about the technology, they often dont know very much about the business. For this reason, they are often viewed with some suspicion by the people at the top of the caste structure - the Executives.
The Executives are those individuals in senior management positions who wield the overall decision-making authority in the organization. They look down upon both Users and Systems People as necessary evils, however they are usually forced to form an uneasy alliance with the Systems People. The purpose of this alliance, theoretically at least, is to ensure that computer technology is exploited to the organizations strategic advantage. Sometimes, both the Executives and Systems People lose sight of the fact that this technology must serve the Users, not vice versa.
THE TRADITIONAL APPROACH TO DRP
In most organizations, the impetus for disaster planning comes from the top down, with the Executives mandating the development of recovery plans. This is usually not their own idea, however. Credit generally goes to a group of individuals that sit somewhat outside the corporate caste structure: the Auditors. These organizational pariahs are continual critics, always harping on about such things as the confidentiality, integrity, and availability of computer systems.
The persistence of the Auditors is the reason many organizations have a disaster recovery plan. Their constant nagging, and talk of gloom and doom, eventually forces the Executives to do something to get the Auditors off their backs. What the Executives do, of course, is tell the Systems People to develop a recovery plan. This is usually followed by a prolonged period of negotiations between the Systems People and the Executives over how much to spend on such a plan. Eventually, a budget is set and the Systems People go off to do their thing.
This top down approach would be fine if the Systems People then went off to work with the Users on plan development. This rarely happens, however. Most Systems departments wouldnt dream of involving the Users in something as technically complex as recovery of computer hardware, communications networks, operating software, application systems, etc. In any event, since the Systems department is now on the nub to deliver a recovery plan as soon as possible, the last thing they want is the aggravation of dealing with Users who would do nothing except slow the process down.
This unfortunately, is a very short-sighted approach. While the absence of User involvement may well hasten the delivery of a recovery plan, the end product may not be worth the paper it is written on.
IMPLICATIONS FOR RECOVERY
Why is User involvement in recovery planning critical? Well suppose, heaven forbid, that the Systems People actually have to invoke the recovery plan. If this is the first time the Users discover that such a thing exists, it may come as quite a shock.
To illustrate, lets look at four hypothetical disaster scenarios: the transparent disaster, the almost transparent disaster, the visible disaster, and the highly visible disaster.
1. The Transparent Disaster--Lets suppose its Friday evening. The Users have all gone home for a long holiday weekend. The I.S. department has shut down all the on-line systems, run all the overnight batch jobs, and taken full backups of all the software and data. Moments after shipping the backups off-site, disaster strikes the data center, putting it out of commission.
The recovery teams are notified immediately, and spring into action. Working around the clock, the teams recover all the applications at the alternate processing facility and reroute the data communications networks to the hot site. When the Users come back to work on Tuesday morning, refreshed from their long weekend, they sit down at their terminals and start working as if nothing had happened.
In this scenario, the fact that the Users know nothing about the I.S. departments disaster recovery plan is immaterial, since the entire exercise was transparent to them. Unfortunately, this scenario is a pipe dream. Disasters are not this neat. Building a recovery plan on the assumption that a disaster will strike at the best possible time is suicide.
2. The Almost Transparent Disaster--Lets look at what could happen if we change a few parameters of the disaster. First of all, lets assume that everything is the same as in the first scenario, with one exception: disaster strikes before the latest backups are shipped off-site. The recovery teams still spring into action, and recover all the applications, but using previous backups which are a day or a week out of date, depending on the backup cycle.
Now when the Users return to work Tuesday morning, it is soon apparent that there is something not quite right with their applications. The phones start ringing in the I.S. department, confusion erupts in the User community, processing errors begin to be made, and the nice, neat recovery effort starts falling apart. Without a clear understanding of what has happened, or what they are supposed to do about it, the Users make erratic and unpredictable attempts to fix up their systems. After a day or two of these uncoordinated activities, business operations are in a shambles: transactions have been lost or duplicated, files are out of whack, the systems are spewing out error messages, and normally routine procedures are becoming chaotic. The original disaster now seems trivial in comparison to the aftermath.
3. The Visible Disaster--Lets look at another scenario. Its the middle of the week. The Users are hammering away at their computer terminals, when they suddenly go dead (the terminals that is, not the Users). After banging on their terminals for a while, the Users start calling the Help Desk, only to get a recorded message that service has been temporarily disrupted. For the next hour or so, the Users sit around cursing the Systems department, until word starts coming down the grapevine that some serious calamity has struck the data center. The balance of the day is spent in confusion and frantic phone calls to the Director of MIS. Eventually, the official position of the MIS department gets out: We have activated our disaster recovery plan. Well get back to you in a couple of days.
In the MIS department, the recovery teams have swung into action and are going through their paces with carefully rehearsed precision. After 48 hours of around the clock activity, the systems are brought up at the alternate facility, and the exhausted recovery teams go home. Meanwhile, back at the ranch, it has been a totally different story. The Users had no carefully rehearsed plans to fall back on. For the most part, they had no plans at all for losing their systems for this length of time. Some areas simply sat around idly waiting for the systems to come back up; others improvised workarounds on the fly; still others tried to re-activate long dormant manual procedures. What no one did was figure out what they would do when the systems did come back up.
The successful recovery effort performed by the Systems People did not, unfortunately, result in a return to business as usual. The systems came back up alright, late in the week, but no one had explained to the Users that they would now have to: a) reprocess the transactions lost during the time period between the last backup and the point of failure; b) process the backlog of transactions that had built up during the recovery period; and c) reconcile the systems with any manual processing, or workarounds, that had been performed while the systems were down. Since the recovery teams had all gone home, there was no one around to help them sort out the mess. In despair, the Users simply went home for the weekend, hoping that Monday would never come. Unfortunately, it always does.
4. The Highly Visible Disaster--There is one more scenario which needs to be presented. This is the one that people responsible for computer recovery plans rarely address, or are rarely allowed to address: the disaster that directly affects the Users working environment, as well as the data center. In this scenario, the data center and the Users are all in the same building. When the disaster strikes, the Users not only lose their systems, they lose access to their offices, their terminals, their PCs, their phones, their faxes, their files, their work-in-progress - everything! Sure, the computer recovery plan still works, but when the recovery teams have finished recovering the systems, theres no one there to use them! All the Users are at home, or wandering aimlessly around the streets.
This type of disaster is obviously highly visible to the User community. They may actually have watched their offices burn to the ground on the evening news. You can rest assured that their first reaction wasnt Oh my God! Weve lost our computer systems. No doubt their only thoughts were about when and where they would return to work, if indeed they ever would return to work. Since the Users had never been involved in any aspects of disaster planning, the thought that something like this could happen to them had probably never crossed their minds. Certainly no one had given any thought to how the Users office facilities could be replaced in the event of a disaster.
The Systems People are no help in this scenario. They are too concerned with their precious computer systems to worry about mundane things like arranging for alternate office space and furnishings, replacing phones and office equipment, recovering lost documents and work-in-progress, installing new terminals and PCs, etc. Presumably, they felt that it wasnt their job to worry about these aspects of disaster planning. But if it wasnt their job, whose was it? The purpose of disaster planning is not merely to recover computer systems, but to ensure that the organization can survive a disaster. Without plans that address any eventuality, including this worst case scenario, this assurance cannot be given.
Part II of III
In Part I of this article, we looked at four hypothetical disaster scenarios. In three of these scenarios, something was missing from
the Recovery Plan. In each case, the plan only addressed what the Systems People would be doing after a disaster, not what the
Users would, or should, be doing. Since User involvement was the critical missing element in the Recovery Planning exercise, this
should not be a great surprise to anyone. To identify the specific components of the Recovery Plan that were overlooked, and to
highlight the criticality of User involvement, each of the four scenarios will be re-examined in detail.
In the first scenario, the disaster occurred on a Friday evening, after all the Users had gone home for a long weekend, and after full backups had been taken off site. In this scenario, nothing is missing from the Recovery Plan. The off-site backups were taken after the Users had finished their normal processing for the week, the disaster occurred while no processing was taking place, and the systems were restored before the Users needed to resume normal processing. The timeline for the recovery looks pretty simple. (See Exhibit 1).
In the second scenario, the disaster also occurred on a Friday evening before a long weekend, but before backups had been taken off site. In this scenario, and in most real-life disasters, a certain amount of User processing is lost when the systems are recovered. Unless the organization can afford to take continuous off-site backups, via electronic vaulting, the data that is recovered will not be completely up to date. This means that the Users must re-process all of their lost transactions in order to restore their systems to a usable state. (See Exhibit 2).
It is absolutely critical that detailed plans for this process of System Restoration be developed and tested. Otherwise, the recovery time period may be extended well beyond acceptable limits, and the integrity of the systems may be seriously jeopardized. For each system, there are essentially three requirements that must be addressed:
1. User Notification - Procedures must be in place to ensure that key User personnel are notified when the Recovery Plan is invoked. These Users would have to arrange for the required staff to be on site as soon as the system is recovered, and would then have to coordinate the System Restoration.
2. Transaction Identification - The User personnel must be able to identify those transactions that were processed after the last off-site backup. This requires that workflows in the User departments be examined, and possibly modified. It also requires that the Users fully understand the off-site backup cycles.
3. Transaction Re-processing - The Users must have routines for re-processing the lost transactions as expeditiously as possible. These are not normal system routines, and must be developed in conjunction with the Users to ensure that they can restore the system to its pre-disaster state without duplication of any completed processes (e.g. re-processing of an Accounts Payable transaction shouldnt generate another payment if one had already been made).
The process of System Restoration is not something that can be worked out on the fly after a disaster has occurred. It must be planned, documented, and tested well in advance. System interdependencies must be identified, and the overall effort must be carefully coordinated to ensure that all affected systems are brought back on stream as cleanly and quickly as possible. All of this can only be accomplished with active User involvement in the Recovery Planning process.
In the third scenario, the disaster occurred during an active business cycle. In most organizations, it is not acceptable to simply shut down shop for a few days until the systems are restored. Critical business functions must continue, in one fashion or another, during the disruption in service. Consequently, the Users must be able to bypass the computer systems and perform these functions an alternate way. These Bypass Procedures must be developed, on a system by system basis, as follows:
1. Identify Critical Functions - The Users must determine the business functions which need to continue during the outage.
2. Document Bypass Procedures - The Users must develop, and document in detail, the processes to be followed to perform these critical functions while the systems are unavailable. Depending on the circumstances, these may be manual processes, PC-based processes, etc.
3. Modify Current Systems - Systems People typically design systems to maximize the Users dependence on those systems. The Users must work with the Systems People to ensure that sufficient flexibility is built into the systems to make the Bypass Procedures viable. For example, if the Bypass Procedures are manual, the information required to support those procedures must be available to the Users in hardcopy form, even if it is normally only available on-line. Similarly, if the procedures are PC-based, the information must be available on the PC, even if it is normally only available on the mainframe.
Development of Bypass Procedures is rarely given the attention it deserves, by either Users or Systems People. Users have a tendency to underestimate their dependence on computer systems, and hence overestimate their ability to function in the absence of those systems. Systems People, on the other hand, seem to have an abhorrence of any form of manual processing: if they cant program it, they dont want anything to do with it. Both of these tendencies must be overcome. Bypass Procedures cant be addressed as an afterthought to the development of a computer system; they need to be considered part of the overall system development process, and must be tested and maintained with the same rigor applied to the computer programs.
Taking a rigorous approach is also important in order to avoid a common oversight with Bypass Procedures. It is not enough to have a workable bypass; the Users must have a bypass that is workable and that permits an orderly return to normal processing after the systems have been restored. In other words, the Users must be able to reconcile the restored systems with the manual, or PC-assisted, processing that was performed while the systems were unavailable. This may seem obvious, but it is often overlooked or left to the Users to figure out after a disaster, with the result that the return to normal may be much longer than necessary.
As we can see from Exhibit 3, the recovery timeline is starting to get somewhat complicated. It is also clear that the Users are going to be very busy during the recovery time period. Not only will they need to invoke their Bypass Procedures to ensure the continuity of critical business functions, they will also need to re-process their lost transactions once the systems have been recovered, and then reconcile the restored systems with the processing that has been taking place in bypass mode. Obviously, failure to involve the Users in planning and testing these processes before a disaster takes place, can have disastrous consequences after.
The fourth, and final scenario added one more major wrinkle to the recovery process. In this scenario, the Users did not just lose their computer systems, they lost access to their place of business. As important as it is to recover these systems, it is even more important to find replacement office facilities for the Users so that they can begin to maintain critical business functions.
The window between occurrence of the disaster and replacement of the Users office facilities is now the most critical part of the recovery timeline. (See Exhibit 4). Until the Users are able to return to work, no business can be conducted, even in bypass mode. This highlights what should have been obvious all along: the most critical components in a Disaster Recovery Plan are not the computer systems, but the Users. Unless your business has been so highly automated that it does not require this human element to function, simply recovering the technology will not return the organization to normal operation.
On one level, development of Facility Replacement plans is relatively straightforward. Alternate office space must be identified, an immediate source of office furnishings and supplies must be determined, a plan for installing replacement terminals and PCs must be in place, and a strategy for re-instating voice communications must be established. While these are not trivial matters, they only involve logistical and financial considerations. Nevertheless, these plans must be carefully documented and maintained and the various procedures reviewed and rehearsed, by those responsible, on a regular basis.
Unfortunately, there is another, more complex level to Facility Replacement planning: the human level. People (i.e. Users) tend to be creatures of habit. They get up at the same times each morning, take the same routes to work, sit down at the same desks, reach into the same drawers for their work-in-progress, and begin following the same daily routines. In the event of a disaster, it is bad enough that their normal routines have been disrupted by the loss of their computer systems. When this is compounded by unfamiliar surroundings and the loss of their personal files, notes, reference material, etc., it can be positively traumatizing. This can be avoided, or at least minimized, however, by involving the Users in the planning and rehearsal processes, to give them an opportunity to prepare themselves psychologically for such an eventuality.
User involvement is also essential to ensure that one final aspect of Facility Replacement planning is properly addressed: Document Retrieval. In the worst case scenario of loss of both office facilities and computer systems, the Users may be expected to maintain critical business functions without access either to the information in their computer systems, or in the hard copy documentation retained at their normal place of work. Unless they have extremely good memories, this can be an impossible task. Consequently, the Facility Replacement plans must include provision for retrieval, from an alternate source, of backup copies of all essential documentation.
The only people that can properly assess what documentation is essential are the Users. Hence, this is a task which must be assigned to them. It is an interesting task since it forces the Users to take a very close look at what they do, why they do it, and how they do it. The insights gained from this exercise can prove to be extremely beneficial, not just to Disaster Recovery, but to normal operations. In any event, identification of the critical documentation is the easy part. The hard part, which requires collaboration between the Users and Systems People, is development of strategies for backing up and retrieving this documentation. This will necessitate procedures for duplicating the documentation and storing it off site, either in hard copy format, or on microfiche, diskette, or other readily useable media.
PART III OF III
In Part II of this article, we discussed why user involvement in the Recovery Planning process is mandatory. The unanswered questions, however, are when to involve them, and how to involve them. In a perfect world, Recovery Planning is not something you do after a system has been developed, it is something you do while the system is being developed. The development and implementation of a new computer system is based on the successful collaboration of Users and Systems People, with the Users providing the business knowledge and the Systems People the technological knowledge. This, then, is the ideal time to involve the Users in the Recovery Planning process. It requires, however, that both parties recognize that the purpose of their collaboration is not just to deliver a functionally and technically sound system, but to deliver a system which can be quickly restored in the event of loss, and which can be temporarily bypassed in order to maintain critical business functions. To deliver such a system, additional discipline will have to be incorporated into each phase of the development cycle.
Before system design even begins, the Users will have to specify one simple parameter that will be critical to the entire process: the Disaster Window. The Disaster Window is the maximum length of time, between loss of a system and its return to normal operation, that can be tolerated without seriously jeopardizing critical business functions. It must be recognized that this window is not merely the length of time that it will take the Systems People to recover the system, since it must also include the leadtime before the Recovery Plan is invoked, and the time it will take the Users to restore and reconcile the system. It will also have to allow for any leadtime for Facility Replacement.
It is important that the Users, not the Systems People, determine the Disaster Window, since they are the ones qualified to assess the business impact of the systems unavailability. They are also the ones who will have to perform the Bypass Procedures, and the System Restoration and Reconciliation procedures. It will be necessary for the Users to make trade-offs in setting this window: the shorter the window, the less disruption to normal operations; the longer the window, the greater the safety margin for the overall recovery effort. Obviously, the critical business functions which will need to be maintained during the Disaster Window must also be clearly specified.
With the Disaster Window and critical business functions specified up front, the system can be designed accordingly. The risk of making a bad design decision, from a Disaster Recovery perspective, is minimized. It also forces the Users to participate actively in the design process, something which does not always happen in the traditional approach to system development. Some of the key design issues to be resolved jointly by the Users and Systems People are reviewed below.
1. Backup Cycles
Systems People are very good at designing backup and retrieval mechanisms for system data. However, the focus is usually on the ability to recover from processing errors. With the Users involvement, backup and retrieval mechanisms can be designed which focus on the ability to recover from a disaster. This requires establishment of backup cycles based on the requirements of the business, not just the requirements of the system. It must also be recognized that establishment of the backup cycles is not necessarily limited to data contained in the computer system. Critical data contained in the Users desks, filing cabinets, PCs, etc. will also need to be backed up offsite.
2. Transaction Processing
Traditionally, systems are designed to process transactions one way: the normal way. In a disaster scenario, transactions may have to be processed in an abnormal way, in order to reprocess lost transactions or reconcile the restored system with the processing that was occurring in Bypass mode. With the Users involvement, the ability to perform this abnormal processing can be built right intothe system design, rather than attempting to develop add on routines at a later date.
3. Batch Processing
In addition to processing that occurs on-line, most systems also do routine batch processing based on normal business cycles (e.g. daily, weekly, monthly, etc.). When recovering from a disaster, special batch processing may have to be performed to get the systems back in sync with the business cycles. Provision for this should be built into a systems design.
4. System Dependence
Systems People have a tendency to over-automate even the most trivial of functions. As a result, Users are frequently locked into a total dependence on the systems availability unnecessarily. With the Users involvement in design decisions, they can ensure that they will maintain a measure of control over the system, and retain the flexibility to perform routine (but often critical) functions without computer assistance. Often this is as simple as ensuring that the Users can look up essential system information via printed reports, microfiche, etc. in the event that on-line access to the information is unavailable.
5. Bypass Procedures
Traditionally, no one develops a means of bypassing a computer system until long after it has been developed. This is the hard way. Bypass procedures should be designed at the same time as the computer systems, since the design of the Bypass Procedures can affect the design of the computer system, and vice versa. Doing both at the same time allows trade-offs to be evaluated, and helps guard against the unnecessary system dependencies mentioned above. The principal responsibility for design of the Bypass Procedures belongs with the Users, since they are far better qualified to determine simple, but workable, means of maintaining critical business functions during the Disaster Window.
6. Distributed Processing
Increasingly, systems development is moving away from the traditional mainframe towards processing that is distributed throughout client-server networks. In many organizations, new systems tend to be a hybrid of centralized processing on the mainframe and localized processing on PC/LANs and/or midrange platforms. Designing a system which does part of its processing centrally, and part locally, can insulate the Users somewhat from the impact of a disaster affecting the central site. User involvement in determining what processing should be done locally, and what data should be maintained locally, is essential. Before pursuing this option, however, two things must be remembered: first, a Recovery Plan will have to be developed for the local system as well as the central system; and, second, if the local system and central system are in the same geographical area,they may both be taken out by the same disaster.
DEVELOPMENT AND TESTING
In the traditional cycle, Users have little involvement in the actual development of the system, but do get involved in testing out the various components as they are delivered to them by the Systems People. This is to ensure that the components function as expected. Of course, they rarely do, with the result that they must be sent back to the Systems People for rework.
In the development cycle being espoused here, an interesting symbiotic relationship occurs. While the Systems People are developing the computerized routines, the Users are developing the bypass routines. Each side has to know what the other is doing in order to keep their efforts and approaches in sync. The checkout of one set of routines has to be matched to the checkout of the other set of routines. Hence, the development and testing of the system becomes a much more collaborative effort, likely producing a better system.
Needless to say, the Users must also be actively involved in development and testing of the various backup and retrieval mechanisms, and any specialized routines required for the reprocessing of lost transactions and reconciliation of the system after a disaster. Since these routines will only be exercised during real or simulated disasters, it is important that the Users prepare detailed instructions on how to use these routines (these instructions must, of course, be backed up offsite).
Implementation of a new system can be a traumatic time for Users and Systems People alike. It marks a transition from one way of conducting business to another, and, no matter how thorough the testing, the transition rarely goes as smoothly as planned. However, a system developed via the approach outlined here can be implemented with less trauma, and less risk of disrupting the business during the implementation, than a system developed via the traditional approach. It is likely, first of all, that increased User involvement in the development cycle will have produced a better system, with less unanticipated glitches. Secondly, the Users will be well positioned to cope with any glitches that do arise, since they need only fall back on their Bypass Procedures while the Systems People make the necessary corrections to the system.
It is almost to be hoped that glitches do arise, since it is essential that the Users exercise their Bypass Procedures under real life conditions. If they do not arise, then one or more tests of the Bypass Procedures will need to be scheduled. It is, in fact, an excellent idea to schedule a full scale test of the Users ability to bypass the system for the entire length of time specified for the Disaster Window. This should be done after the new system has been thoroughly shaken down, but before the system has become a thoroughly ingrained part of the Users day-to-day routines.
There will likely be some resistance to shutting down a brand new computer system unnecessarily. However, there can be considerable benefits to such a move: the Systems People can test their ability to recover the system from offsite backups under simulated disaster conditions; the Users can prove that their Bypass Procedures work under real-life conditions; and both the Users and Systems People can confirm that the System Restoration and Reconciliation routines work as expected. Most importantly, however, a highly visible exercise of this nature raises everyones awareness of the importance of Disaster Planning.
In Part III of this article, we looked at involving Users in Recovery Planning during the development cycle for new systems. Recovery Planning, of course, cannot be confined to new systems only. When an organization first begins development of a Disaster Recovery Plan, there are usually dozens of critical systems that must be addressed all at once. While it would be nice to involve the Users of these systems in the Recovery Planning process right from the start, this is rarely practical. Of necessity, the Systems People must operate somewhat unilaterally, at least initially, to develop plans for recovering these systems from offsite backup. Since the existing systems were unlikely to have been designed with Disaster Recovery in mind, this may be a rather messy process, and it may be that the less the Users see at this stage, the better.
REVIEW OF RECOVERY
However, there is no reason not to involve the Users once the basic recovery requirements have been satisfied. In fact, User involvement again becomes mandatory, since there is no way that the Systems People, operating in isolation, will be able to develop Bypass Procedures, and System Restoration and Reconciliation routines. The logical starting point for User involvement is a detailed review of the backup cycles established by the Systems People, and the sequence of events that would occur during a recovery effort.
What this review should highlight immediately are the logistics of System Restoration, that is, the reprocessing of transactions required to restore the system to its status as of the time of the disaster along with any special batch processing required to get the system back in sync with the normal business cycles. Working together, the Users and Systems People must attempt to optimize this process. This may involve improvements in the backup cycles, refinements in the Users normal workflows, and modifications to the system to facilitate reentry of lost transactions and re-synchronization of batch processing cycles. The result of this exercise should be a thorough understanding, by the Users, of the System Recovery and Restoration processes, and incorporation of these processes into the overall recovery timeframe.
DEVELOPMENT OF BYPASS PROCEDURES
With this timeframe established, the Users now have an appreciation for the length of time that their system may be out of commission. They now must turn their attention to the development of Bypass Procedures required to maintain critical business functions during the outage. However, this must be a collaborative effort, since it is likely that the Systems People will have to make modifications to the system to support these procedures. At the very least, the Systems People will have to ensure that the Users have the capability of reconciling the system with any processing that may occur in Bypass Mode.
As with new systems, it may be highly desirable, at this stage, to test the Users ability to function in Bypass Mode for a period of time. Again, there may be resistance to this move, but it is truly the proof of the pudding. If the Users, or their management, want assurance that they can survive an extended system outage, there is really no alternative. It is far better to confirm the viability of the Bypass Procedures under controlled, pre-planned circumstances, with the computer system still available to fall back on, than under the do-or-die circumstances of a real-life disaster.
The mere fact that such a test is even being considered can have a tremendous psychological effect. Many people, Users and Systems People alike, have difficulty in taking Disaster Planning seriously, and simply go through the motions. Many a Bypass Procedure has been developed, to satisfy some silly Audit requirement, without giving any serious consideration to the procedures viability. It is amazing how much more attention to detail will be paid if it is known that the procedure may actually have to be used!
This psychological effect can also get the Users thinking beyond just the issue of computer system recovery. Once they have accepted the possibility of a disaster putting their systems out of commission, it does not require a giant mental leap to consider the possibility of a disaster putting their office facilities out of commission. This may then prompt them to begin taking stock of all their resources that either need to be backed up offsite or replaced in the event of disaster. If enough User departments start thinking proactively about this issue, it should start creating pressure, from the bottom up, to develop formal Facility Replacement plans. This, of course, is really what User involvement in Recovery Planning is all about - to get the Users on the DRP bandwagon so that effective Recovery Plans are being demanded by them, not inflicted on them.
PARTICIPATION IN SCHEDULED TESTING
The last area of User involvement that will be discussed is, in all likelihood, the first area that would actually be addressed. This is User involvement in the regularly scheduled tests, of the overall Recovery Plan, conducted by the Systems People. In the traditional top-down approach to DRP, the I.S.. department typically reaches the Scheduled Testing stage long before anyone considers involving the Users via the processes outlined in the preceding sections. This is not entirely unreasonable, since preliminary testing efforts are usually focused, not on recovery of the Users systems, but on recovery of the hardware and software platforms, and the telecommunications networks.
Once testing has progressed beyond these high level components of the plan, however, there is no longer any justification for ignoring the Users. In fact, these Scheduled Tests provide an excellent opportunity to introduce the Users to the wonderful world of Disaster Recovery Planning. A full-scale recovery test is actually a very impressive event, given the huge number of tasks that must be performed with clockwork precision. Publicizing these tests, through formal presentations to the User community, cannot fail to have a significant impact. Even though the technical details may be lost on them, a step-by-step review of the sequence of events that must occur during a test is bound to impress the Users with the effort that goes into protecting their computer systems from loss. This can be reinforced if desired, by arranging for key Users to tour the alternate processing facility and offsite storage facility.
Having thus created an awareness of these tests in the User community, the next step is to involve them in the process. This should begin with participation in the checkout of System Recovery. Two of the major goals of Scheduled Testing are, firstly, to verify that the Users systems can be properly recovered from the offsite backups, and, secondly, to verify that the systems function normally under the new environment. This requires execution of a specific test plan for each system, and verification of the results. The Users should be given the opportunity to participate in the development of these test plans before the test, and evaluation of the test results after the test.
As the Users become familiar with the testing process, they should then be given the opportunity to participate during the test. Working out of their normal offices (at least initially), they would perform the checkout on-line, providing a more timely verification, and one that more closely approximates real-life conditions. This would also provide them with the opportunity to test out their ability to reprocess lost transactions. In subsequent tests, they could do their checkout from different locations, giving them a chance to assess the implications of losing their normal office facilities.
As User involvement in Scheduled Testing grows, it is reasonable to expect that they will go beyond being mere participants. The Users will likely begin setting their own test objectives, focused on resolving business issues, not technological issues. The testing will also become more integrated, as Users begin working with Users to address the implications of recovering multiple interrelated systems at the same time. Eventually, the testing may evolve to a two-team structure, with one DRP team planning and coordinating the efforts of the people responsible for the technology (i.e. the Systems People), and another DRP team planning and coordinating the efforts of the people responsible for the business (i.e. the Users).
Scheduled Tests are major events, involving considerable expenditure of time and money. Involving Users in these tests will make them even bigger, and more expensive, events. However, it is an expense with real payback, not just in terms of disaster preparedness, but in terms of effective utilization of technology. Once the Users have been converted to the discipline of Disaster Planning, they will, inevitably, view their computer systems in a new light. No longer will they accept systems developed via the traditional approach, or tolerate recovery exposures in their existing systems. They will insist upon systems being designed, or re-engineered, to ensure recoverability and the continuity of critical business functions. The result will be better systems.
Disaster Recovery Planning is a discipline that needs to address far more than just an organizations computer technology. There are many types of disasters that can afflict an organization, and the ability to survive any such eventuality depends upon addressing DRP from the overall business perspective, not just the computer perspective. An increasing awareness of this fact can be seen in the gradual replacement of the term Disaster Recovery Planning with terms like Business Resumption Planning and Business Continuity Planning, labels which clearly place the emphasis on the business.
Nevertheless, recovery of computer technology remains a critical component of Disaster Planning, and is the area that most organizations address first. This is not illogical, provided that the planning efforts do not stop once the technological issues have been addressed. The way to ensure this is to bring the Users of the technology into the equation. The Users must become active players in the planning and execution of regularly scheduled tests, and must share accountability, with the Systems People, for the recoverability of their computer systems. They must also accept full accountability for the continuity of critical business functions, and must take a proactive role in tackling the non-technology issues associated with Disaster Planning.
Recovery Planning cannot afford to be the exclusive domain of the Systems People. If they choose to make it so, by failing to bring the actual Users of the systems into the process, the impact on the organization may be disastrous.
David A. Johnson is President of the Toronto Chapter of the Disaster Recovery Information Exchange, and a member of Disaster Recovery Journals Editorial Advisory Board.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.