The Disaster Recovery Walkthrough: An Exercise in Reality
- Published on Sunday, 28 October 2007 22:39
On November 3rd a pipe burst in the computer room ceiling sending water cascading into the controller resulting in the loss of the controller and some DASD. This coincided simultaneously with a breakdown of the telephone system . . . or so said the note handed to all Data Processing employees that morning at approximately 10:00 a.m. The note did not come as a complete surprise as the Data Processing Director had given warning that a mock disaster drill would soon take place in order to test the company’s fledgling Disaster Recovery Plan.
During the drill, the team members met in an established “command center” to report on the progress of their area. After the drill, each team member was asked to comment on his role, other’s role, and the plan in general, the results of which I have summarized here.
As might be expected, the primary purpose of a walkthrough is to test the disaster plan. However, the walkthrough also produced some unexpected and beneficial results. Besides the plan, the walkthrough tested people and their perception of how they, their respective areas, and other areas functioned within the department.
As regards to testing people, the DASD Manager proved to be the key individual in the drill. He stated the technical situation and then precisely delineated his step-by-step approach to the problem. His importance was commented on by virtually all the other team members who were also concerned by the absence of anyone to back him up.
The walkthrough also challenged participants to be creative, for it would be impossible for any workable plan to be so codified as to meet every contingency. This was nowhere better demonstrated that with the operations shift supervisor, who showed some realistic imagination when he reported, “We attempted to quiesce the system but were prevented due to sparks flying in the Machine Room. Therefore it was necessary to hit the emergency power button. We disconnected the Halon System and covered all equipment possible with plastic.” He then sent runners (as the telephone system inoperative), to notify Physical Plant to shut off the water and requested pumps to remove same. Showing initiative, he sent an Operator out to buy hair dryers, using the Operations Manager’s credit card, in order to dry the equipment.
The participants’ perceptions of how the department functioned and how other sections of the department should function proved most instructive. The communications member noted that the disaster emphasized the true interactive nature of the department. However, reflecting his area, he believed that Operations needed to be more aware of their environment (power circuits, water pipes) and that Applications needed education in hardware/communications terminology. Indeed, this “jargon” manifested itself throughout the team members’ reports. One discovered remedy for the jargon problem proved to be the log book. It acted as central recording device for all technical events, since the individual designated as “scribe” could not be everywhere at once and suffered from the same jargon problem that plagued other non-specialists.
The walkthrough also tested not only the plan itself but the elements necessary for a successful walkthrough. The operations shift supervisor noted that the message implementing the disaster drill made no mention of what application was to be running at the time of the disaster, a point well-taken since a critical payroll job stream was actually running at the time.
Generally, the team members thought the drill to be a success; indeed even a disasterous walkthrough would have been successful since it would have pointed out the plan’s flaws. The walkthrough was beneficial on several levels. The Assistant Director was impressed by the preparedness of the Systems staff, especially the fact that they had multiple copies of essential backup tapes. He also stressed the need for CICS forward recovery for VSAM files and off-site storage (currently in place). His only criticism was the lack of total communication between participants. The Systems Manager noted that with the loss of SYS1.LINKLIB Systems personnel would have been unable to log on to TSO. A skeleton TSO procedure and ID were set up to address this problem. Most of the walkthrough’s results fell within two areas: expected and unexpected.
Expected results generally manifested themselves in the establishment of an offsite storage facility, CICS forward recovery for VSAM files, and adequate backup for key personnel. Like any data processing product, there were “bugs” in the plan that had to be corrected.
It was, however, the unexpected results that proved to be most instructive. This basically amounted to a newly discovered sense of confidence that a disaster could be met and that the demonstrated flexibility, imagination, and professionalism of the staff could overcome minor plan flaws. The drill brought out a degree of professionalism theretofore unnoticed in some individuals. The drill also demonstrated the highly diverse nature of the data processing department as a whole, and the segregated areas of specialization within each area. Recognizing that, the walkthrough was useful in that it gave the department a project in which they could function as a single unit, adding to a greater understanding of how their counterparts functioned. For the individual charged with the development and maintenance of his department’s disaster plan, the walkthrough can have all the timeliness of yesterday’s newspaper and can fall victim to its own success. Rememberance of a successful walkthrough can breed a false sense of confidence. Personnel turnover, rapidly changing hardware and software, and out-of-date documentation all point up the need for the walkthrough, to be truly useful, to be a scheduled, periodic event. Only in this way can the disaster plan act as an educational tool for management and new employees and give the organization the knowledge and tested skills needed to reduce the effects of a true disaster.
Robert D. Hargrove is a contingency planner at the University of Texas Health Science Center.
This article adapted from Vol. 2, No. 4, p. 11.