|
DISASTER
RECOVERY
JOURNAL
Return
to the Spring 2001
Index
P. O. Box 510110
St. Louis, MO 63151
(314) 894-0276
Fax: (314) 894-7474
Internet
www.drj.com
E-mail drj@drj.com
PUBLISHER &
EDITOR-IN-CHIEF
Richard L. Arnold, CBCP
richard@drj.com
SENIOR EDITOR
Janette Ballman
janette@drj.com
EDITOR
Michelle Saab
michelle@drj.com
COPY EDITORS
Edward H. Pearce, CBCP
drj@drj.com
Richard
Sandhofer
richards@drj.com
INTERNET /
ADVERTISING
Robert Arnold
bob@drj.com
_____________
Corporate
President/CEO
Richard L. Arnold, CBCP
richard@drj.com
Vice
President
Robert Arnold
bob@drj.com
CONFERENCE COORDINATOR
Patti Fitzgerald, CBCP
patti@drj.com
CONFERENCE REGISTRAR
Merce Knese
mercedes@drj.com
CIRCULATION
Laura Baugh
laurab@drj.com
INTERNATIONAL
CONTACTS
England: Thom Hetherington
Business Continuity
Phone: 0161-237-1007
thomh@tempus.demon.co.uk
Australia: Anthony J. Harvey
Journal of Business Continuity
Phone: 0011-613-953-0055-8
fax: 0011-613-953-0528
sector@notability.com.au
Japan: Shinji Hosotsubo
Quake Japan Co., Ltd.
Phone: 03-3215-2880
fax: 03-3215-2881
Brazil:
Jose Carlos Ferreira
Disaster Recovery Mercosul
Phone: 55
11 3666-9506
conc2000@uol.com.br
ww.drms.com.br
|
|
Click
Here for a Printable Version
Addressing
Disaster Tolerance in an e-World
-
by Daniel S. Klein
On
the subject of problem solving, the American writer H.L. Mencken noted,
For every complex problem there is an answer that is clear, simple
and . . . wrong.
This point to ponder is particularly appropriate for those
whose responsibility it is to analyze, design, engineer, deliver, and
support disaster-tolerant Information Technology resources for businesses
or organizations. The point being that there are many possible solutions
that can be employed to achieve some form of disaster tolerance -- and
some of them are deceptively simple.
But the reality is that, in the pursuit of disaster tolerance, every
business has different needs. Your business models are different, your
economic models are different, and even different parts of your business
have differing needs. All of these elements need to be evaluated and
balanced to ensure that the result is not only the most appropriate
technological solution but also the most efficient economic solution.
And achieving this balance is not a trivial matter.
It goes without saying that, from an IT perspective, the 9-5 world we
once knew is no longer a reality. Today, we have an environment in which
customers, suppliers, and partners expect instantaneous transactions
and immediate responses to their orders or questions around the clock.
Once considered part of a businesss cost and asset base, today
IT plays a much more strategic role. In many cases, IT is the business.
This article will focus on how you can begin to address the issues of
IT disaster tolerance in the quest for true business continuity in a
way that will best meet your business, economic, and technology needs.
Driving
the need for disaster tolerance
The need for disaster-tolerant solutions is driven by at least three
factors: critical applications, decentralization/outsourcing, and around-the-clock
reliability.
Critical applications
The more our operations and core business needs become electronic and
dependant on information technology, a rapidly increasing number of
routine applications are becoming absolutely critical and must always
be up. Think of your supply chain, your Enterprise Resource Planning,
your Customer Relationship Management. Even e-mail is now critical because
realtime, global communication is an integral part of what we do.
Decentralization/outsourcing
The second driver is decentralization or, more precisely, the outsourcing
of operational activities. In previous years, businesses kept a tight
reign on all their internal operations such as sales, manufacturing,
and shipping -- no matter what industry they were in. And thus they
were in full control of the information technology that supported those
activities. This gave them more control and reinforced a direct relationship
with their customers.
Today, in an effort to streamline, many businesses want to divest themselves
of all operations not directly associated with their core purpose. Simple!
Therefore, they outsource to companies who have the expertise they do
not. The plus side, of course, is that companies can save money, increase
profits, and focus on their core competencies.
We mentioned the up side. But the down side is that, should there be
a problem in any one of these outsourced systems, the business is not
in business. In fact, by simplifying through outsourcing the business
may have created a highly complex dilemma.
Around-the-clock
reliability
The third driver of disaster-tolerant solutions is around-the-clock
reliability. With more applications considered critical and more potential
points of failure, companies must assess their need for around-the-clock
reliability. How acceptable is a moment -- or minutes -- of downtime?
What are the risks? Where can loss be tolerated? How much is acceptable?
And, of course, how much are they willing or able to spend to address
these matters?
A
simple formula
Economically speaking, the decision to implement a disaster-tolerant
solution - or, more precisely, the level of disaster tolerance to be
implemented -- is based on a simple formula: the cost of extended downtime
and the risk of a potential loss should outweigh the cost of the disaster-tolerant
solution and the supporting infrastructure. How one determines the elements
of this deceptively simple equation and, further, what to do about it
is another matter entirely!
And the matter mentioned earlier -- going for greater operational simplicity
and ending up with potentially intractable systems complexity -- is
a paradox that must be addressed to ensure appropriate disaster tolerance.
It really is like trying to put a square peg in a round hole.
A
three-step process
There is, unfortunately, no simple solution. But there are ways to approach
the problem. Fundamental to success is the realization that a disaster-tolerant
environment should be designed from a systemic and holistic approach,
that is, with an understanding of the multi-dimensional nature of operational
reality.
The pursuit of a disaster-tolerant solution is a three-step process
but what you do with these steps and when you do it are critical! And
it is also important to recognize that different parts of your organization
will require different responses to these three steps.
Step 1 -- Determine operational characteristics in the context
of the business model
Determine the operational characteristics in the context of your business
model. This is a fancy way of saying: figure out the key factors
of the way you do business, how they are supported by your IT systems,
and where the priorities are.
How many of your applications are critical? Most or all? Do you outsource
most of your non-critical functions? How much do you want to streamline
your operations while leaving open the possibility of increasing points
of failure? How important is around-the-clock reliability? What is your
acceptable loss?
All of these questions factor in to your disaster-tolerant approach.
At your first pass the most important characteristics to consider are
transaction centricity and/or data centricity and recovery time and/or
recovery point
Do you need fast recovery or recovery to the exact state prior to the
disaster - or both? If you cannot resume processing within a second
will it be inconvenient, seriously damaging, or catastrophic? Conversely,
if you do not resume processing right where you left off, will it be
inconvenient, seriously damaging, or catastrophic?
If you are running a Web site and you cannot keep up with user demand,
business is lost as users grow impatient and click elsewhere. In fact,
research conducted by Oracle Corporation has shown that customers will
wait no more than seven seconds before moving on. Similarly, a transaction-centric
operation like a financial trading floor requires complete integrity
of transactions -- with no interruption --or the result may be huge
losses. In these cases, recovery time is of the utmost importance.
A banks back office operation is a data-centric organization.
It may withstand a little disruption, but when it restores its data,
it better be accurate. Here, recovery point is the focus.
Step 2 -- Finding balance
Find the balance among three key aspects -- Technology, Services, and
Procedures and Discipline -- that impact disaster tolerance.
Technology: physical and logical components that make up the IT and
network environment, such as systems, equipment, software, network,
data, storage, and power.
Services: remedial, preventive, service providers, third-party evaluations
and reviews, off-site personnel, environmental concerns (HVAC, fire
prevention, and so on.)
Procedures and Discipline: internal rules, policies, recovery plans,
practices, drills, cross training of personnel, succession planning,
and the discipline necessary to ensure implementation.
Having gotten this far in the approach we can now begin to build a model,
a one-picture summary, if you will, of what we are addressing. We recommend
that you use the chart as a touchstone or a reminder to ensure that
you consider all the appropriate factors and do all of the necessary
activities.
So far, we have three Aspects of concern: Technology, Services, and
Procedures and Discipline. All of these must be applied against your
business model.
This essentially sums it up. Figure out the operational requirements
and then determine the Technology, Services, and Procedures and Discipline
to apply in each case.
All this would be sufficient in a static world. But our environments
are not static, not a slice in time. In fact they are by definition
quite dynamic and require constant adjustment, updating, change, and
improvement. All of which can throw a monkey wrench into any model that
may work at a particular but singular stage.
Step 3 -- Addressing the dynamic nature of e-business environments
So we have the third step: addressing the dynamic nature of e-business
environments. For each area to which you are applying some level of
disaster tolerance you must constantly focus on planning, protecting,
and (reality being what it is) recovering your resources.
For each aspect of concern (Technology, Services, and Procedures and
Discipline), you need a process that begins with a plan, a way to protect
the plan for successful implementation, and a recovery activity when
an incident occurs.

Figure
1. The static Disaster Tolerance Planning Model showing Aspects and
Key Elements of the Business Model.

Figure
2. The entire Disaster Tolerance Planning Model showing Aspects, Activities
and Key Elements of the Business Model.
Plan
Planning is the quintessential piece of the puzzle. It requires you
to fully examine the model, goals, and needs of your organization along
all three aspects of concern.
You will need to ask the following questions: What is the business?
How do you generate revenue, deliver products and services? What are
your data needs, customer characteristics, supply chain?
What are the risks compared to the consequences? How likely is a flood
or hurricane?
What is the risk of malicious attacks? What are the consequences if
someone kicks the power cord out?
Not everyone needs a 24x7, year-round computing infrastructure. Similarly,
not every part of your organization requires the same level of disaster
tolerance. Do you need fast recovery or recovery to an exact state prior
to the failure?
What will your organization do? After you analyze the system environment,
you must architect the right disaster tolerance strategy. You need to
decide the level of protection and restoration and how to acquire technologies,
services, procedures and disciplines.
Once these things are understood you can then determine how you will
do things, when they need to be done, and who will do them. And it is
always helpful to frequently ask yourself, Why? This tends
to keep you on track and avoid going down any blind alleys or inefficient
paths.
Consider the financial, operational, technical, and personnel pieces
of this puzzle.
And of course, remember that Murphys Law is pervasive!
Protect
During the Protect stage, you design, implement, and manage systems,
resources, and procedures to support your plan. It is the most active
stage and will require engagement on a daily basis.
On an on-going basis you must ask questions, adjust, document changes,
test, rehearse, try, adjust -- all while conducting your day-to-day
operations.
You will need to ask questions such as the following: What are the risks
to the technology and data? What technology should we deploy? How can
it address or minimize the risk?
How frequently should we test? How should we monitor internal and external
services?
What are the maintenance requirements, procedures, and discipline?
How will we manage change? How do we address contingencies?
Recover
Recovery is the final aspect that ensures business continuity. You hope
it will not have to reach this stage. But if a disaster strikes, and
you have planned and protected your operations and data, recovery should
perform exactly as you had expected. Well, maybe not exactly. However,
if you properly planned and protected, you can meet your recovery priorities,
adjust ad hoc, and have what at the time are the most important operations
up, running, and accurate.
For example, one companys recovery plan states that before anything
is done, they must assess the current state of the business and its
priorities and determine which operations support these priorities.
Only after doing this analysis will begin recovery procedures.
We have now reached the point where we are clear on the business model
and priorities, the best balance of our three aspects (Technology, Services,
and Procedures and Discipline), and how to address the dynamic nature
of the business and its supportive technology infrastructure. Three
steps. Three aspects. Three balance points. Everything you need to do
to ensure the level of disaster tolerance and business continuity you
require. With a solid grounding in this approach you can recast Mr.
Menckens statement: For every complex problem there is an
answer that is clear, simple and . . . disaster tolerant.
Daniel S. Klein is the High
Availability and Disaster Tolerant Solutions Marketing Manager in Compaq
Computer Corporations Custom Systems & Solutions Business
Unit. During his 14 years at Compaq, he has worked with a wide variety
of IT solutions in such markets as manufacturing, consumer packaged
goods, health care, education, and government.
The author is indebted to Jeffrey Schiebe and Ron LaPedis for their
contributions to this article.
©Copyright
2000 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
|