|
DISASTER
RECOVERY
JOURNAL
Return
to the Winter 2002
Index
P. O. Box 510110
St. Louis, MO 63151
(314) 894-0276
Fax: (314) 894-7474
Internet
www.drj.com
E-mail drj@drj.com
PUBLISHER &
EDITOR-IN-CHIEF
Richard L. Arnold, CBCP
richard@drj.com
SENIOR EDITOR
Janette Ballman
janette@drj.com
EDITOR
Michelle Saab
michelle@drj.com
COPY EDITORS
Edward H. Pearce, CBCP
drj@drj.com
Richard
Sandhofer
richards@drj.com
INTERNET /
ADVERTISING
Robert Arnold
bob@drj.com
_____________
Corporate
President/CEO
Richard L. Arnold, CBCP
richard@drj.com
Vice
President
Robert Arnold
bob@drj.com
CONFERENCE COORDINATOR
Patti Fitzgerald, CBCP
patti@drj.com
CONFERENCE REGISTRAR
Merce Knese
mercedes@drj.com
CIRCULATION
Laura Baugh
laurab@drj.com
INTERNATIONAL
CONTACTS
England: Thom Hetherington
Business Continuity
Phone: 0161-237-1007
thomh@tempus.demon.co.uk
Australia: Anthony J. Harvey
Journal of Business Continuity
Phone: 0011-613-953-0055-8
fax: 0011-613-953-0528
sector@notability.com.au
Japan: Shinji Hosotsubo
Quake Japan Co., Ltd.
Phone: 03-3215-2880
fax: 03-3215-2881
Brazil:
Jose Carlos Ferreira
Disaster Recovery Mercosul
Phone: 55
11 3666-9506
conc2000@uol.com.br
ww.drms.com.br
|
|
Click
Here for a Printable Version
INFORMATION TECHNOLOGY
Your
Disaster Tolerant IT Solution: How Does It Measure Up?
By ROBERT LYONS
How well is your computer
system protected if a disaster were to occur? This article explores
the three ways of configuring computer systems in order to provide disaster
tolerance: remote copy, remote computing, and wide area clustering.
Improved technology in high-speed long distance interconnects such as
Fibre Channel and ATM has allowed many vendors to offer disaster tolerant
products and solutions. It is important to understand how well these
disaster tolerant solutions work.
One way to measure the quality of the solution is by how little time
it takes to recover from the outage and resume operations. The ability
to restore all of the data to the instant before a disaster occurs may
not be sufficient. There are many technical solutions that can be used
to capture data. Most of the solutions can even handle the multiple
I/O transactions common to database applications. But how does your
solution measure up in terms of the time needed to restore operations?
Less time taken to get computing back on-line may significantly reduce
the cost of the outage and can be considered a competitive advantage
to some businesses.
For example, in a high volume manufacturing plant we usually find a
process tracking system run by the IT organization. The system tracks
labor, raw materials, finished goods, work in progress, and many other
details which are vital to the company. Many of the common disaster
causes such as flood, weather, fire, and loss of power, that could stop
plants operation are also likely to force a shutdown of the computer
system.
The plant may be the companys only manufacturing facility, it
may make critical parts of a larger product, or it may be just one of
the many sites that turn out large volumes of the finished goods. In
all cases, when the plant is not running, it is costing the company
money each minute, hour and day that it is down. It is important to
estimate that cost since it varies for each business and is needed to
help justify the disaster tolerant effort. A production line may be
able to recover from this lower volume by having employees work overtime,
but a funds transfer system in a bank cannot use the same strategy to
recover from delayed transactions. Patient medication tracking in a
hospital may be able to fall back to files and written charts hung on
the patients bed, but an on-line securities exchange cannot use
paper and pencil to track the thousands of trades per hour that occur
each day. Each case is different, and there may be more than one case
within a company, especially when multiple businesses in a company share
a key computing system. In our example of the manufacturing plant, while
the clean up may take hours or days, the warehouse could continue operations
if they still had access to computers to manage raw material deliveries
and finished good shipments. In most cases, the sooner that orders can
be shipped, the sooner the company can restore business operations.
Having the computers operational will reassure customers, since order
status can be checked, e-mail communications can be resumed and the
web site can be back on-line.
Remote Copy Is The Lowest Form
Of Disaster Tolerance
Making a continuous copy of the data is the first step in protecting
your critical IT infrastructure. When one copy is physically located
some distance from another copy, then we have our basic disaster tolerance
capability. The distance between the copies will determine what disasters
can be tolerated. As Figure 1 shows, we can have the computer storing
data on a local disk and have a disk farther way to hold a duplicate
copy of the critical data.

There are many different methods and technologies that can be used to
create and maintain the remote copy. The replicated data can originate
within the application, the copy can be created by mirror/shadow software
in the operating system, or the copy can be created by functions built
into the storage controllers. The copy may be maintained synchronously
or asynchronously, where the data write operation may or may not be
considered complete until both the local and the remote copy are on
the non-volatile surface of a disk. In its crudest form, remote copy
could be implemented using the off-site daily backup and using transaction
journaling to a disk that is remote from the original computer system.
Whichever method is chosen, all of these implementations allow the companys
critical data to be held at a safe distance away from the original site
in case the data storage equipment is destroyed or inaccessible. In
order to restore operations with the remote copy strategy, the data
is required to be loaded onto suitable disks and the application programs
adjusted to use the new copy of the data.
If the remote copy is outside the computer room, but in another room
of the same building, the computer system is protected from disasters
that occur in the computer room only, such as a fire or an explosion.
On the other hand, a flood or an earthquake would likely affect both
rooms of the same building, therefore, moving the data out of the computer
room but keeping it in the same facility, only protects the business
from small disasters.
But what if the original computers remain inaccessible or were destroyed
by the same forces that destroyed the original copy of the data? If
processing depends on some special or custom devices, then it will be
difficult to resume computing without them. For the data to be useable,
the computers, application programs, and user access network need to
be operational and accessible. Recovery time with remote copy can be
measured in as little as a few hours, but it will usually take one or
more days since access to the original computer room or a surrogate
installation can be lengthened by the original disaster.
Remote Computing Improves The
Disaster Tolerant Solution
Improved disaster tolerance recognizes that both data and compute power
are needed at an alternate site. With alternate computing resources
in place, recovery of the IT infrastructure can be improved typically
10 times faster than with remote copy.
The time that the original data center is inaccessible or the time it
takes to make an alternate site operational can range from five hours
to 50 hours (assuming custom equipment does not need to be ordered).
With an alternate site, the recovery process is much more straightforward
and documented recovery procedures do not need to handle the complicated
process of locating a computer to use. This alternate computing site
does not need to be running the same application programs and, in fact,
the alternate systems can normally be used for wholly different functions,
such as running less critical applications, software development and
testing, or even as training systems. As long as the alternate site
has all of the equipment necessary to meet the minimum level of performance,
the company is operational and additional or replacement equipment can
be installed if the original site is not expected to be available in
the near future. It may also be possible to relocate equipment from
the original site to the alternate site if the equipment is functional
but inaccessible. In Figure 2, we see a configuration example, which
has computing and a copy of the data at both sites. Note that although
the computer at a site is drawn to show it is physically close to the
alternate sites second copy of the data, they are not actively
connected.

To continue with the factory example, the amount of processing power
needed to run the warehouse should be significantly less than needed
to run the production floor and warehouse combined. This would allow
the computers at the second site to be smaller and less expensive.
On the other hand, recovery steps needed to restore operation typically
include a system reboot, database reload, application re-vectoring,
user access rerouting, and other steps necessary to adjust computing
operations to suit the alternate site. If the alternate site were running
less critical applications, then those programs must be shifted or shut
down in an orderly fashion. The time it takes to perform these common
failover steps can range from a few minutes to a few hours.
As long as the cost of being out of operation for up to a few hours
can be tolerated, compared to the infrequent occurrence of a disaster,
then the improved strategy meets the business needs. However, there
are some businesses that cannot accept hours of continuous outage. Real-time
systems, such as nuclear power plant control and air traffic control
systems are obvious cases that cannot tolerate long outages. Other cases
include a catalog order entry system, a police and fire dispatch system,
or a hospital management system, where significant operation outages
cannot be tolerated. These businesses should consider the next configuration.
Wide Area Clustering Provides
The Ultimate In IT Disaster Tolerance
With data replicated at both sites and sufficient processing capacity
present, applications can remain on-line when using an actively clustered
configuration. Active clusters allow simultaneous application execution
on all of the computers in clustered system. When configured as a wide
area cluster, the application can remain operational without regard
to the physical location of the hardware. In this configuration, if
a site becomes inoperable, the remaining workload continues at the remaining
site(s). Failover efforts typically range from minutes if you want to
optimize parameters for the new workload, to no time at all. This is
because little or no manual intervention is needed to keep the application
up and running on the remaining systems. As shown in Figure 3, we allow
concurrent access to the full set of disks from the systems at both
sites.

Active clustering relies on two specific technologies. The first technology
provides a coordinating software function to manage access to the data
so that multiple copies of the application can read and write to the
same files without corrupting the contents. The coordination is usually
provided by facilities in the operating system or can be built with
special procedures that are written by the application developer. Recent
advances in high-speed wide-area communications channels such as FDDI,
fibre channel, and ATM or other high-speed packet switching services
adds the long distance communication technology needed to implement
this solution. The high availability of clustering, when added to the
disaster tolerance of wide area configurations, produces the ultimate
in IT solutions. Any outage, whether it is a single system or a whole
site, will continue to function after the loss of part of the overall
system.
In conclusion, we see remote copy provides only limited protection and
is the worst solution in terms of recovery time. Remote computing offers
a superior solution in cases where businesses have sufficient capital
budget for the equipment, and the forethought to implement a faster
recovery. For those businesses that recognize the criticalness of their
computing infrastructure, and have chosen to have the business ride
through a site outage, wide-area clustering offers an outstanding
solution with virtually the same equipment investment as in the remote
computing solution.
Robert Lyons is a systems consultant at Resilient Systems Inc. (www.resilientsys.com).
He has over 10 years experience in designing and implementing disaster-tolerant
configurations worldwide.
To comment on this article, go
to 1501-16 at www.drj.com/feedback.
©Copyright
2002 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
|