|
12
Point Check-up
by Jan Persson, CDP
INTRODUCTION
My formal training in the area of disaster recovery began in 1980 following
a near miss in the computer center. A building under construction
next door to the data center (which housed a very large central mainframe
and disk system) let loose of about 12 feet of 4x4 construction supports
from the 30th floor. It was a rainy evening in Chicago when the incident
occurred. The material came through the roof and pierced the ceiling
and ended about 2 feet over the CPU. Only by luck no one was injured.
Power of course was shut off immediately. The CPU was covered but still
sustained some water and debris contamination. Should have been more
careful you say? We were. Our lawyer issued a letter of concern, which
resulted in 2 additional layers of heavy-duty lumber covering the entire
roof. Our insurance was checked and found to be acceptable. And we routinely
kept an eye on the safety precautions next door. Didnt matter,
the unexpected happened.
The following day the VP of IS informed the company management of the
incident and stressed the impact had the CPU been damaged
severely which certainly could have happened. The outage could well
have been 30 days. His prior mention of disaster recovery now received
more focussed attention. The CEO who was a major system user understood
the implications and within several weeks a Hot Site subscription
was in place and a first test scheduled.
The point to be made from this brief story is that a good DRP program
should plan for the unexpected. The potential sources for disasters
are far more today than in 1980. It is simply prudent to review the
DRP in terms of all areas that might well be included. Some may seem
out of scope but in fact are not. In the end, a probing
review of the DRP may well surface some issues that if addressed today
might well prevent a disaster tomorrow.
Hopefully this article will reach outside the more traditional views
and identify a few points you may want to look at in an effort to strengthen
the plan and keep pace with change.
ASSUMPTIONS
This article is not intended to be a comprehensive critique of all plan
elements and requirements. Please keep in mind a few assumptions as
you go through the article:
- It is assumed that you have a DRP in place today.
- It does not attempt to delineate between large, medium and small computer
centers. That would make it too complex and lengthy. Use what applies
to you.
- It presents points that most DRPs must address but does not
try and apply weights to the different points. It is assumed that some
are more critical than others in various centers.
- This article is not meant to be a comprehensive DRP audit. It is simply
a number of key points (an even dozen) that can be used to demonstrate
a sense of Readiness.
RATING
SCALE
Just for fun I decided to give you an opportunity to think about each
point and give yourself a rating. Typically when I do training sessions
I ask people the questions and then sort of say; How do you feel about
where you are? For this exercise I present the following rating scale
to use:
5 = It has been proven to me that we have this point covered, period!
4 = I have been told that this point is covered and I tend to agree.
3 = I feel we meet the minimum requirements for this point.
2 = Im not comfortable that we have this covered as well as we
should.
1 = I know this is an exposure and Im very worried about it.
CONFIRMATION
CHECKUP
Over the past 16 years, since I started my disaster recovery consulting
practice, Ive been in hundreds of computer centers and have written
and audited over 200 Plans. More recently Ive been doing a lot
of disaster recovery training seminars and workshops which gives me
even more exposure to differed sites and their overall readiness.
I have learned to ask a few key questions, which indicate to me how
prepared they really are. I call them the Basics. In short,
if these are not done well the plan has little chance of performing
as expected.
I call this first group of points the Confirmation Checkup.
Too often I see people gloss over these with generalization comments
like:
- Yes, weve tested our plan. We were able to bring up the
Operating System at the Hot Site.
- Sure we backup our data. The daily backups are on-site and the
weekly backups go offsite.
- We know whats critical. Our I.S. staff did a list about
3 years ago.
- I think our realistic recovery goal is 72 hours. Management
probably thinks it is all done in a few hours.
Its just
like the old adage An Ounce of Prevention. A successful
DR Planner always has an eye on prevention and any other area that can
reduce a disaster situation. This is probably true today more than ever.
The computer center, while locked and staffed still is vulnerable to
outside attacks from people you cant even see. Disaster avoidance
is a part of DRP. Lets face it, when the systems are unavailable for
any reason the end users dont really care what caused the outage.
They just know, to them, its a disaster when payroll is due out
in 1 hour and the system is down. Hackers, fire, power outage, virus
attack, whatever.
What we dont need to hear are comments like:
We keep the doors propped open; it makes it easier for everyone.
You know, changing passwords is a real pain in the neck. Thats
why we only do it once a year.
What do you mean you found active IDs and passwords for
30 people who were terminated last year?
I think we should just give everyone free access to the Internet.
Downloading neat stuff is sort of a company perk. Besides, its
a lot faster on the high speed line at work.
EXPANSION
CHECKUP
Do not under any circumstances be lulled into complacency with a DR
Plan that is Finalized, Done and Tested. Good DR Planners
constantly have their antenna up and can pick up blips on the radar
screen. Good planners also know how and when to stick their nose into
any area that implements change that would either affect or change the
recovery capability. Once again, if the business impact (BIA) is known,
the job is easier.
Comments like these are just not acceptable anymore:
I know its critical but its not in my area. Well
wait until they have a problem to get involved.
No one told me they added 45 more disk drives to support the critical
applications.
Now weve got some new critical equipment installed that
the Hot Site doesnt even offer.
Calculate your
point score as follows:
1. Add up the score you applied for each point
2. Apply the following assessment:
12 to 24 = A lot of work needs to be done. A Lot!
24 to 36 = Needs improvement. Not a good comfort level.
36 to 48 = Not bad. In fact pretty darn good.
48 to 60 = Excellent work! You are to be commended.
Caution: More important than the score is a good sense of what points
need to be strengthened to create a better plan.
FINAL COMMENTS
Its a question of survival. And lets face it, DRP plays a major
role in survival. It is my belief that DRP has moved well past the early
days (last 15 years) of the rather mundane tasks of: write a plan, backup
the data, and go to the Hot Site (often by ourselves) and test once
a year or every two years. We are in fact the Sentry for
business survival. We need to ask the difficult questions, stick our
nose in when we sniff a problem, and be a proactive, optimistic, and
positive component of the company.
I read a recent (January 2001) trade journal article that presented
some relevant statistics (I like statistics). It indicated that today
over 50% of corporations have a RTO of less than 24 hours. To me, that
means critical. Can your plan meet the goal of 8 or 10 hours to be operational?
Is data mirroring in your future?
Another recent (also January 2001) article strongly suggested that E-Commerce
will not only survive but will play a key role in many companies survival.
The growth forecast is astounding. The E-Business technology is beginning
to show great rewards in cost savings and extended sales growth. Is
your E-Business plan in place? Will it recover in one or two hours?
In yet another article (October 2000 White Paper) presents a case for
Network Storage given the Storage Explosion. Rates are projected
to drop from $.30/MB to $.01/MB by 2005. How do you balance these costs
against the cost to recover 1, 2 or even 3 full days of lost input?
Where do SANs, LANs and GANs fit in your plan?
As we move along in this DRP Arena we need to keep a diligent
eye on the Basics, protect the assets, and expand to keep pace.
|
Point
|
Confirmation
Checkup
|
Your
Rating
1
to 5
|
Give
yourself a "5" if:
|
|
1
|
Business
Impact
Confirm
that all Critical Business Functions are identified using
a BIA approach with end user input and are covered in
the Recovery Plan regardless of the platform they run
on. A Recovery Time Objective (RTO) such as 24 hours is
identified. $$$ Loss is clearly stated.
|
|
You
can produce a complete, current application list and a
BIA done within the last 12 months. And your plan includes
a specific RTO.
|
|
2
|
Data
Backup
Confirm
that computer data is backed up on a regular basis, and/or
mirrored to an alternate site. All platforms are covered
and the backup media (usually tape) is sent offsite immediately
upon creation. Multiple copies or gens provide a fallback
should any tapes be missing, damaged, or unreadable.
|
|
You
can produce a complete set of backups (or mirrored disk)
that would totally be able to rebuild all platforms synchronized,
to the proper point.
|
|
3
|
Recovery
Window
Confirm
that based on the RTO (Recovery Time Objective) as stated
from the BIA, the data backups can support the goal based
solely on media kept offsite. RPO is the Recovery Point
Objective, i.e. the ability to restore to a specific point.
The goal is RTO is possible given RPO. If your goal is
24 hours, 5-day-old data is a big problem.
|
|
This
one is simple. Give yourself a 5 if you have tested this
and it worked as expected. For all platforms!
|
|
4
|
Testing
Confirm
that comprehensive testing of all critical platforms,
critical applications and network components in an alternate
location (usually a Hot Site) is complete and accurate
based on actual documented test results.
|
|
You
earn a 5 if you test once or twice a year, include end
users in duplicating critical application environments
and meet the RTO.
|
|
5
|
Executive
Concurrence
Confirm
that the CIO and often Sr. Management have been a party
to the decisions and financial commitment to provide a
DRP and they agree with the recovery parameters.
|
|
You
earn a 5 here if your Sr. Management, CIO (and in some
cases the Board) have signed off on the plan goals,
specifically the RTO.
|
|
Point
|
Protection
Checkup
|
Your
Rating 1 to 5
|
Give
yourself
a
"5" if:
|
|
6
|
Hardened
Facility
All
critical sites should be Hardened which includes limited
access, badge or combination door locks, 24X7 guards or
video surveillance, full UPS, fire protection, forced entry
alarms, heat sensors, water sensors, etc.
|
|
Security
precautions are in place and followed by everyone. No exceptions.
UPS is regularly tested along with all other alarm mechanisms.
|
|
7
|
Intrusion
Protection
A
compliment of tools and controls are in place including
current virus software, IDS (Intrusion Detection System),
firewalls, data transmission encryption and digital certificates.
All of which must be operational at the Hot Site.
|
|
Intrusion
has been addresses at the primary facility and also tested
at the alternate site. Virus software is updated very often
(daily) and encryption is used.
|
|
8
|
Redundancy
The
goal is to remove, or at the very least minimize, Single
Points of Failure. For example: It appears in all areas
such as Network (dual paths), H/W (backup units or hot swappable
components), S/W (source code prior versions), Facility
(Hot Site), Disk (mirrored or RAID), Power (UPS, alternate
grid), etc.
|
|
You
have identified and documented all Single Points of Failure
and implemented a solution or work around.
|
|
Point
|
Expansion
Checkup
|
Your
Rating 1 to 5
|
Give
yourself a "5" if:
|
|
9
|
Web
Site Recovery
A
disaster recovery plan and alternate site solution is in
place to restore Web Site service in what is usually a very
short time period. This may well be a separate plan from
the more traditional DRPs. It requires extremely fast response
action since any outage of a web site is immediately known
by all.
|
|
Your
web site recovery plan is actually documented, alternate
site contains redundant equipment, data is mirrored and
a failover procedure can be implemented quickly.
|
|
10
|
Business
Unit DRP's
Detailed
disaster recovery plans are in place and have been tested
for all critical business units.
|
|
All
critical departments have a Team Leader identified and a
set of response steps that have been tested.
|
|
11
|
Change
Management
The
DR Team is in the loop when all infrastructure (HW, SW,
Net, etc) changes, upgrades, removals, etc. are planned.
|
|
You
are part of the planning process and are not Surprised
when changes are made.
|
|
12
|
Awareness
Training
A
regular program is in place, corporate-wide, to conduct
disaster recovery awareness and crisis management training
workshops.
|
|
Disaster
Recovery and Crisis Management training are a part of
the program just like training in technical skills, and
drills have been conducted.
|
|
Jan Persson, CDP, has worked
in the I.T. field since 1967. He began his disaster recovery involvement
in 1980 and in 1985 started his own disaster recovery consulting practice,
PERSSON ASSOCIATES. He has written and/or audited over 200 DR Plans,
worked for and with the 3 major disaster recovery firms, conducts DR
training seminars and workshops, and continues to take an active, hands-on,
role in DR activities in all size shops and environments.
©Copyright
2000 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
«BACK
to the Articles Index
|