With the relatively recent onslaught of such natural disasters as Hurricane Hugo and the Loma Prieta earthquake, many businesses are realizing just how crucial it is to develop and update their disaster recovery plans. While this is a good first step, it is by no means an adequate enough precautionary measure if there is no testing before, during, and after the plan is implemented. Testing is what indicates the effectiveness of a plan. Therefore, it is important that as much care be exercised in testing the plan as in developing it. Time has a way of eroding a plan’s effectiveness for the following reasons:
- Environmental changes occur as organizations evolve, new products are introduced, and new policies and procedures are developed. Such changes can render a plan incomplete or inadequate.
- Hardware, software and other critical equipment change.
- Personnel may lose interest or forget critical parts of the plan.
- The organization may experience personnel turnover.
Therefore, realistic testing of the recovery plan periodically is necessary and is also required by regulatory agencies. Some benefits from testing include:
- Determining the feasibility of the recovery process
- Verifying the compatibility of backup facilities
- Ensuring the adequacy of procedures relating to the various teams working in the recovery process
- Identifying deficiencies in existing procedures
- Training of various team managers and members
- Demonstrating the ability of the organization to recover
- Providing a mechanism for maintaining and updating the recovery plan
Training on special and critical skills that may be required in a disaster situation is an important part of the process. These special skills include first aid; fire extinguishing; emergency breathing equipment; evacuation of personnel, assets, and sensitive resources; emergency communications methods; and shutdown procedures for equipment, electricity, water, and gas. Education and training of recovery personnel in special, critical, and multiple skills can weigh significantly on the success of the plan and the time required to execute it.
The authenticity of the test will vary, depending on some of the following factors:
- Physical size of the installation
- Sensitivity of the organization to data processing services
- Level of service required by users
- Time deemed acceptable for contingency processing and recovery
- Number of locations involved
- Cost to perform the test
Several types of testing can be performed by the organization, including structured walk-through testing, checklist testing, simulation testing, parallel testing, and full interruption testing. Disasters or problems that occur during the normal course of business should also be documented and included in the plan.
Structured Walk-Through Testing
During a structured walk-through test, disaster recovery team members meet to verbally walk through the specific steps of each component of the disaster recovery process as documented in the disaster recovery plan. The purpose of the structured walk-through test is to confirm the effectiveness of the plan and to identify gaps, bottlenecks or other weaknesses in the plan.
A checklist test determines if sufficient supplies are stored at the backup site, telephone number listings are current, quantities of forms are adequate, and a copy of the recovery plan and necessary operational manuals are available. Under this testing technique, the recovery team reviews the plan and identifies key components that should be current and available. The checklist test ensures that the organization complies with the requirements of the disaster recovery plan.
A combination of the checklist test and the structured walk-through test is suggested for initial testing to determine modifications to the plan before attempting more extensive testing.
During this test, the organization simulates a disaster so normal operations will not be interrupted. A disaster scenario should take into consideration the purpose of the test, objectives, type of test, timing, scheduling, duration, test participants, assignments, constraints, assumptions, and test steps. Testing can include the notification procedures, temporary operating procedures, and backup and recovery operations. During a simulation, the following elements should be thoroughly tested: hardware, software, personnel, data and voice communications, procedures, supplies and forms, documentation, transportation, utilities (power, air conditioning, heating, ventilation), and alternative site processing. It may not be practical or economically feasible to perform certain tasks during a simulated test (e.g., extensive travel, moving equipment, eliminating voice or data communication).
A parallel test can be performed in conjunction with the checklist test or simulation test. Under this scenario, historical transactions, such as yesterday’s transactions, are processed against the preceding day’s backup files at the contingency processing site or hot-site. All reports produced at the alternate site for the current business date should agree with those reports produced at the existing processing site.
A full-interruption test activates the total disaster recovery plan. This test is costly and could disrupt normal operations. Therefore, it should be approached with caution.
Adequate time must be scheduled for the testing. Initially, the test should not be scheduled at critical points in the normal processing cycle, such as the end of the month. The duration of the test should be predetermined to measure adequate response time.
Various test scenarios could be planned to identify the type of disaster, the extent of damage, recovery capability, staffing and equipment availability, backup resource availability, and time/duration of the test. The test plan should identify the persons responsible and the time they need to perform each activity. However, only part of the plan should be tested initially. This approach identifies the workability of each part before attempting a full test. Also, it may be best at first to test the plan after normal business hours or on weekends to minimize disruptions. Eventually, unannounced tests can be performed to emphasize preparedness.
For organizations with a relatively new plan, a quarterly or semiannual test may be prudent for the first year. After this initial period, semiannual or annual tests should be required as a matter of policy.
EVALUATION OF TEST RESULTS
Personnel should log events during the test that will help evaluate the results.
The testing process should provide feedback to the disaster recovery team to ensure that the plan is adequate.
The recovery team, which normally consists of key management personnel, should assess test results and analyze recommendations from various team leaders regarding improvements or modifications for the plan.
It is essential to quantitatively measure the test results, including:
- Elapsed time to perform various activities
- Accuracy of each activity
- Amount of work completed
The results of the tests will most likely lead to changes in the plan.
These changes should enhance the plan and provide a more workable recovery process.
Testing the disaster recovery plan should be efficient and cost-effective. It provides a means of continually increasing the level of performance and quality of the plan and the people who execute it.
A carefully tested plan provides the organization with the confidence and experience necessary to respond to a real emergency.
Disaster recovery plan testing should consider scheduled and unscheduled tests for both partial and total disasters.
Geoffrey H. Wold is the National Director of Information Systems Consulting for the CPA/consulting firm of McGladrey & Pullen. He specializes in providing a wide range of planning, operational and EDP related services for financial institutions.
This article adapted from Vol. 3 No. 3, p. 34.