Q&A Main Page

Ask A question

Email Us

 

Return Home

 

Question:

I have always thought that the best IT Disaster Recovery Test is
one that is tested to failure. Others in my organization belive we need to
make it a SAFE test so as not to waste test time. I prefer to test real
life scenarios like not dup-taping all your tapes beforehand so you have
extra copies in case of media errors.

What is the best practice for IT disaster recovery testing? We are a large
iSeries shop and rely on our daily backup tapes for our recovery. I would
like to see us not prepare our tests for success but use a real lif
situation.


Answer 1:

This is an excellent question. We are having a similar discussion at my
company.

With limited test times and resources there is strong argument for having a
stable test so that all the various applications and systems can have a
shot. If you have a central point of failure during the test that prevents
the other applications or systems from being tested you may have lost more
than you gained.

The other argument is just as strong, if we don't replicate fully a
disaster environment then we are not truly testing our overall capability.

We are looking at creating a testing schedule that allows us to conduct
smaller isolated tests in-house for those applications that require at
least one test per year. Having satisfied a regulatory requirement and
giving those teams a full shot at testing their plans we can do the larger
"real recovery" test to ensure that those critical single points of failure
are recoverable.

This may allow us to have it both ways. Since we are still in the
discussion mode I would like to read other's responses as well.

Please contact me if you have questions.

Thanks,
Edward H. Pearce CBCP


Answer 2:

The answer to testing for failure or testing for success is yes. Both are critical requirements and both goals can be achieved successfully. The key is in establishing complete recovery test objectives and then doing an honest assessment in terms of likelihood of success and identifying any gaps.. Test time at a hot site is very expensive and difficult to schedule so planning for success is critical. Test objectives should be established in terms of primary, secondary, and tertiary.

Lets use the tapes as an example. The test objective might be to perform the recovery from the last backup tapes. The problem is that the organization might not send the tapes offsite because they fear that the tapes will be needed for a production recovery, so duplicate tapes are created. Using the duplicates should be identified as a deviation to the test plan. The deviation should then be evaluated to determine if this is acceptable because the primary tapes would actually be available in a disaster. If the primary tapes would not be available in a disaster, then a deficiency resolution plan needs to be created to make duplicate tapes a part of the normal operational process.

"Rigging the test" for success is viewed negatively in the industry and by auditors, but identifying the deviations and developing a resolution plan or identifying why the deviation is acceptable is an accepted practice. It is important to progress through a test as far as possible to expose any unexpected problems. When the unexpected is encountered, you will have tested to failure.

Dave Ziev


Answer 3:

Testing recovery procedures is analogous to testing a new computer system,
and typically goes through 4 stages:

Standalone Testing: Individual procedures are tested to confirm they work
as planned. If 'bugs' are found, you may work around them to get to EOJ
(end of job) so that you can test as much of the procedures as possible.
After the bugs are fixed, you test again ... and again and again if
necessary.

Integrated Testing: Once you have confirmed that individual procedures work
as planned, you conduct integrated tests to confirm that the the procedures
work together as planned. Again, if problems are found, you may work around
them to get to EOJ. After the bugs are fixed, you again perform standalone
tests, then repeat the integrated tests.

Stress Testing: Once everything seems to be working together as planned,
you conduct stress tests to try to make the procedures fail (e.g. high
volumes, abnormal conditions, missing links). This is the stage where you
are testing for failure rather than testing for success. Many different
types of stress test may be required to approximate the various situations
that may exist in an actual disaster. If problems are found, you fix them,
redo the standalone and/or integrated tests, then the stress tests.

Maintenance Testing: Once you have confirmed that everything works together
as planned, under both normal and abnormal conditions, your testing can be
set up on a more routine, repeatable basis to confirm that all procedures
continue to work as expected. Obviously, your Change Management procedures
are critical at this stage. If a significant change has been made to the
recovery procedures, or the environment that you are recovering, you may
have to go back to square one and redo the standalone and/or integrated
tests, and possibly even the stress tests.

Needless to say, this can add up to a lot of testing, so designing your
tests for maximum effectiveness and efficiency is paramount. It is also
essential that you try to select recovery strategies that simplify testing.
For example, it is a lot easier to test recovery of a critical data base if
that data base is being replicated remotely than if it has to be restored
from weekly and daily tape backups.

Hope this helps.

Dave Johnson


Answer 4:

There have already been some great answers to this question, but I have an additional suggestion that might be helpful.

It is frequently difficult for the planner to obtain buy in from technical staff and even their managers on a number of issues and levels, particularly the one you are expressing with the backups and testing. Obviously, at least to me, the real concern regarding testing is not making it "safe" but avoiding embarrassing exposures of possible flaws in procedures and processes. I think one way to deal with your situation is to follow some of the recommendations from the previous posts regarding doing isolated or standalone tests, explaining to the staff that it is a means to get where you need to be and pose it to them as a way to work together to prepare for "integrated" and "stress" testing. I would recommend that you keep the initial steps and tests as far off the radar screen as possible which will make staff more comfortable with the process and see you as a "team player" rather than a threat. Once the more basic problems are addressed and staff are comfortable with their procedures you can move forward to the larger scale tests which will also by their nature be more visible to the organization and senior management.

Good Luck

Dean A. Izett (CBCP)


The responses reflect the views of the individual EAB member, and do not necessarily reflect the views of their employers, the DRJ, or the EAB as a whole.