|
Question:
I have always
thought that the best IT Disaster Recovery Test is
one that is tested to failure. Others in my organization belive we need
to
make it a SAFE test so as not to waste test time. I prefer to test real
life scenarios like not dup-taping all your tapes beforehand so you have
extra copies in case of media errors.
What is the
best practice for IT disaster recovery testing? We are a large
iSeries shop and rely on our daily backup tapes for our recovery. I would
like to see us not prepare our tests for success but use a real lif
situation.
Answer
1: This is an excellent
question. We are having a similar discussion at my
company.
With limited test times and resources there
is strong argument for having a
stable test so that all the various applications and systems can have
a
shot. If you have a central point of failure during the test that prevents
the other applications or systems from being tested you may have lost
more
than you gained.
The other argument is just as strong, if
we don't replicate fully a
disaster environment then we are not truly testing our overall capability.
We are looking at creating a testing schedule
that allows us to conduct
smaller isolated tests in-house for those applications that require at
least one test per year. Having satisfied a regulatory requirement and
giving those teams a full shot at testing their plans we can do the larger
"real recovery" test to ensure that those critical single points
of failure
are recoverable.
This may allow us to have it both ways.
Since we are still in the
discussion mode I would like to read other's responses as well.
Please contact me if you have questions.
Thanks,
Edward H. Pearce CBCP
Answer
2:
The answer to testing for failure or testing for
success is yes. Both are critical requirements and both goals can be achieved
successfully. The key is in establishing complete recovery test objectives
and then doing an honest assessment in terms of likelihood of success
and identifying any gaps.. Test time at a hot site is very expensive and
difficult to schedule so planning for success is critical. Test objectives
should be established in terms of primary, secondary, and tertiary.
Lets use the tapes as an example. The test objective might be to perform
the recovery from the last backup tapes. The problem is that the organization
might not send the tapes offsite because they fear that the tapes will
be needed for a production recovery, so duplicate tapes are created. Using
the duplicates should be identified as a deviation to the test plan. The
deviation should then be evaluated to determine if this is acceptable
because the primary tapes would actually be available in a disaster. If
the primary tapes would not be available in a disaster, then a deficiency
resolution plan needs to be created to make duplicate tapes a part of
the normal operational process.
"Rigging the test" for success is viewed negatively in the industry
and by auditors, but identifying the deviations and developing a resolution
plan or identifying why the deviation is acceptable is an accepted practice.
It is important to progress through a test as far as possible to expose
any unexpected problems. When the unexpected is encountered, you will
have tested to failure.
Dave Ziev
Answer
3:
Testing recovery procedures is analogous to testing
a new computer system,
and typically goes through 4 stages:
Standalone Testing: Individual procedures are tested
to confirm they work
as planned. If 'bugs' are found, you may work around them to get to EOJ
(end of job) so that you can test as much of the procedures as possible.
After the bugs are fixed, you test again ... and again and again if
necessary.
Integrated Testing: Once you have confirmed that
individual procedures work
as planned, you conduct integrated tests to confirm that the the procedures
work together as planned. Again, if problems are found, you may work around
them to get to EOJ. After the bugs are fixed, you again perform standalone
tests, then repeat the integrated tests.
Stress Testing: Once everything seems to be working
together as planned,
you conduct stress tests to try to make the procedures fail (e.g. high
volumes, abnormal conditions, missing links). This is the stage where
you
are testing for failure rather than testing for success. Many different
types of stress test may be required to approximate the various situations
that may exist in an actual disaster. If problems are found, you fix them,
redo the standalone and/or integrated tests, then the stress tests.
Maintenance Testing: Once you have confirmed that
everything works together
as planned, under both normal and abnormal conditions, your testing can
be
set up on a more routine, repeatable basis to confirm that all procedures
continue to work as expected. Obviously, your Change Management procedures
are critical at this stage. If a significant change has been made to the
recovery procedures, or the environment that you are recovering, you may
have to go back to square one and redo the standalone and/or integrated
tests, and possibly even the stress tests.
Needless to say, this can add up to a lot of testing,
so designing your
tests for maximum effectiveness and efficiency is paramount. It is also
essential that you try to select recovery strategies that simplify testing.
For example, it is a lot easier to test recovery of a critical data base
if
that data base is being replicated remotely than if it has to be restored
from weekly and daily tape backups.
Hope this helps.
Dave Johnson
Answer
4:
There have already been some great answers
to this question, but I have an additional suggestion that might be helpful.
It is frequently difficult for the planner to obtain buy in from technical
staff and even their managers on a number of issues and levels, particularly
the one you are expressing with the backups and testing. Obviously, at
least to me, the real concern regarding testing is not making it "safe"
but avoiding embarrassing exposures of possible flaws in procedures and
processes. I think one way to deal with your situation is to follow some
of the recommendations from the previous posts regarding doing isolated
or standalone tests, explaining to the staff that it is a means to get
where you need to be and pose it to them as a way to work together to
prepare for "integrated" and "stress" testing. I would
recommend that you keep the initial steps and tests as far off the radar
screen as possible which will make staff more comfortable with the process
and see you as a "team player" rather than a threat. Once the
more basic problems are addressed and staff are comfortable with their
procedures you can move forward to the larger scale tests which will also
by their nature be more visible to the organization and senior management.
Good Luck
Dean A. Izett (CBCP)
The
responses reflect the views of the individual EAB member, and do not necessarily
reflect the views of their employers, the DRJ, or the EAB as a whole.
|