| DISASTER
RECOVERY
JOURNAL
P. O. Box 510110
St. Louis, MO 63151
(314) 894-0276
Fax: (314) 894-7474
Internet
www.drj.com
E-mail drj@drj.com
EXECUTIVE PUBLISHER
Richard L. Arnold, CBCP
richard@drj.com
EDITOR-IN-CHIEF
Jon Seals
jon@drj.com
SENIOR
EDITOR
Janette Ballman
janette@drj.com
ASSOCIATE
EDITOR
Ed Pearce, CBCP
ed@drj.com
ASSISTANT EDITOR
Pamela Clifton
pamelaclifton@hotmail.com
COPY
EDITORS
Jim Hammill, CBCP
Richard Sandhofer
richards@drj.com
ADVERTISING
Robert Arnold
bob@drj.com
_____________
Corporate
President/CEO
Richard L. Arnold, CBCP
richard@drj.com
Vice
President
Robert Arnold
bob@drj.com
CONFERENCE COORDINATOR
Patti Fitzgerald, CBCP
patti@drj.com
CONFERENCE REGISTRAR
Merce Knese
mercedes@drj.com
CIRCULATION
Laura Baugh
laurab@drj.com
EXECUTIVE
COUNCIL
Mike Croy, Forsythe
Jeff Dato, MBCP, KPMG
John Jackson
Edward S. Devlin, E.S. Devlin & Associates
James Hammill, CBCP, JMH Consulting Inc.
Pat McAnally, SunGard Availability Services
Brian Turley, Strohl Systems
Belinda Wilson, Hewlett-Packard
INTERNATIONAL
CONTACTS
England: Thom Hetherington
Business Continuity
Phone: 0161-237-1007
thomh@tempus.demon.co.uk
Japan: Shinji Hosotsubo
Crisis Management and Preparedness Organization
Phone: 03-3519-6270
fax: 03-3519-6255
hosotsubo@cmpo.org
Brazil: José Carlos Ferreira
Disaster Recovery Mercosul
Phone and fax: 011-3666-9506
jocaff@uol.com.br
|
|
Click
Here for a Printable Version
Top 5
Causes of Replication Failure Every IT Team Needs to Know
By PAUL D’ARCY
With all the miles I log driving the vast Texas highways, sooner or
later I know the odds will catch up with me and my car will end up with
a flat tire. It may be an inconvenience, but I don’t worry because
I always have a spare that will adequately cover my needs until I can
replace the damaged tire. If only it were that easy traveling the information
highways in today’s business world, where downtime of a mission-critical
application can mean significant productivity and financial loss to
a company and even an inability to operate at all. Many companies assume
that they are covered with their own “spare” because they
maintain replicated copies of critical application and data.
For today’s businesses, a replication system – a second
copy of corporate data that is stored in a remote datacenter to ensure
data continuity and application availability – is a solution that
must work 100 percent of the time. Period. Unfortunately, the reality
is that most replication solutions are inherently complex and prone
to failure. While replication serves an important function in the enterprise,
it is important for IT executives to understand the most common causes
of replication failure and to evaluate these against their own company’s
efforts.
Five Most Common Causes of Replication
Failure
1) Secondary Environment Not Ready for Failover
Establishing a secondary environment that contains an operable copy
of a company’s most critical applications and data sounds like
the perfect solution. In reality, it is very difficult to maintain two
nearly identical environments that may be located hundreds or thousands
of miles apart. In order for a company to failover from a primary application
to a replicated standby server, all software, patches, and configurations
need to be consistent or failover will most likely not function properly.
As an extreme case, we have seen companies whose self-managed replication
process failed when it turned out that the secondary server that they
depended on as a safety net, had been quietly repurposed to another
task and was no longer available to failover. In these cases, nobody
at the company realized the servers were gone until they tried to failover
and it didn’t work! In many more cases, the failover process is
unsuccessful because the primary and secondary environment fall out
of sync as changes are made inconsistently to the primary and secondary
environment.
If using this form of replication, it is critical for a company to take
the time necessary to develop processes to control the introduction
and distribution of changes and updates to both environments. Additionally,
it is important to monitor critical processes that impact the readiness
of the secondary environment to ensure system readiness if a situation
occurs that requires its use.
2) Manual Error in Failover Process
The reality is we are only human and not surprisingly people in a crisis
are often the weakest link in a failover process. It is not uncommon
for manual errors to be introduced during a failover sequence which
can corrupt the entire process. After remedying the problem resulting
from the error, the entire failover process must be restarted. When
it takes 350 steps to failover a typical 10 server Microsoft Exchange
environment, you can imagine the likelihood for a single error is very
high. According to Gartner Group, only 20 percent of system failures
are due to hardware, operating system, or environmental issues. The
80 percent majority are due to either application errors or human error.
It is important that companies recognize this potential for errors and
look for solutions that minimize or eliminate the need for lengthy manual
processes. Where processes can be automated, the possibility for errors
can be greatly reduced – especially in a time of crisis.
3) Experts Not Available During Crisis
It’s good to be needed, but too often failover processes are dependent
on expert staff that may not be available during a crisis. Because the
failover process impacts the full range of technical disciplines from
hardware and operating systems, to applications and databases, and networking
and security, it may be necessary for highly specialized personnel in
different areas take responsibility for their portion of the failover
process. If even one of these trained individuals is not available,
the failover process can break down.
Companies need to develop a list of required skills and resources available
for each skill needed in the failover process. Better yet, look for
new solutions that automate many or all of the failover steps or deliver
these steps as a remotely managed service – without requiring
an expert staff member to trigger the failover process. If your failover
process depends on a few key people to work successfully, you may not
be able to failover during a severe crisis.
4) Failover Process Unable to Scale
The old “snowball” effect can impact large organizations
in the midst of a failover effort with a limited technical staff required
to manage large numbers of systems. It is not unusual for a critical
application failure or facility problem to result in dozens or more
systems that require failover – and multi-server failover is a
serial, manual process. One administrator can only failover one server
at a time, and many mutually dependent systems will not work until the
entire environment has failed over. The delays are unavoidable and the
pressure to rapidly restore service is unrelenting – not a great
situation to be in for technical staff working in a crisis situation.
Complex failovers can take many, many hours.
There are not many easy options to remedy the potential scaling issue
in the failover process, but having as many trained staff on hand and
trying to minimize system dependencies can help reduce the delays of
significant failover situations. Likewise, automation can dramatically
shorten the failover process by parallelizing important steps and more
quickly and accurately executing complex failover scenarios.
5) Untested Failover Assumptions Don’t Work
Optimism and faith are wonderful character traits, but not when the
health of the business depends on it. Unfortunately, many companies
have invested in complex, multi-server failover solutions that they
have found too sensitive to actually test. Without testing every permutation
of systems and failure causes, it is impossible to know exactly what
will happen during a real crisis. Large, complex environments can have
many failure types and scenarios, and multi-server failover involves
constantly changing conditions.
It is highly desirable to have a failover solution that allows the organization
to test the functionality and pinpoint different failures caused by
different behaviors. Some options to consider include incorporation
of “what-if” scenario planning sessions and “pre-mortems,”
a form of role playing that allows a technical staff to identify untested
failover scenarios and potential bottlenecks.
The Impact of Replication Failure
Replication is a widely used solution for application availability,
and it is important for companies to recognize the potential for failure
and likely ramifications when self-managed replication systems inevitably
do fail. We have discussed the five most common causes for replication
failure, but the impact on the enterprise is also important to understand.
First, replication adds complexity: an enterprise will need to double
the amount of hardware and software of a stand-alone application and
will also require additional bandwidth to handle the traffic to the
secondary environment. The complex architectures and process required
to support replication add significant new monitoring, management, and
crisis response demands.
Second, replication can lead to system problems that are very complex
to resolve.
If not managed properly, replication problems can have widespread impact
and can lead to other technical issues such as database corruption.
Finally, replication failures can be costly due to the time and effort
required to recover the data and restore it, and the application downtime
that clearly impacts the company’s bottom line and its business
operations.
Unreliable Replication is More Risky Than No Replication at
All
The key to successfully implementing a replication solution is to understand
and eliminate as many risks as possible. Knowing your own organization,
and its particular needs and technical resources, will help determine
which of these risks is most likely to impact your company. The good
news is that new solutions are now available that have been designed
to eliminate the technical complexities and risk for human error inherent
in many replication solutions. This new breed of application availability
solutions automate the failover process so that it is fast, easy, scalable
and not prone to manual error.
Most importantly, companies are now delivering replication solutions
as “managed failover services” that provide not just automation,
but remote control and management of the full failover process. These
systems provide constant self-monitoring of the primary and secondary
environments and the replication queues to provide a “green light”
that your safety net is available, and that failover will succeed if
initiated. Very importantly, this monitoring is done from a third-party
location, away from your primary and secondary environment and off of
your network so availability to the outside world can be accurately
judged.
Finally, in evaluating managed failover services, it’s important
to look for solutions that include 24x7 automated and live monitoring,
and the ability to delegate the execution of the failover process at
a moment’s notice if your staff is unable to reach the network
to execute the failover. With a reliable service, a failover can be
executed remotely with a simple phone call. With these new application
availability solutions, a company can always count on their secondary
environment being ready to go in the event of an emergency – big
or small.
A New Era in Replication Solutions
While replication solutions represent a powerful way to protect a company’s
most critical data and applications, it is important to understand the
potential for replication failure and the risks associated with these
approaches. As new managed failover services are coming to market, we
are seeing new ways to improve on the benefits of replication with automated
processes that reduce risks, improve the time-to-failover speed, and
add a higher level of reliability. It’s good news for the enterprise:
new application availability solutions that can help keep businesses
driving smoothly – with no need to worry about potholes anymore.
Paul D’Arcy is vice president of marketing for MessageOne, a leading
provider of business continuity solutions. With more than 10 years of
technology marketing experience, he holds an MBA from Harvard University
and a bachelor’s degree from Wesleyan University. For more information
visit www.messageone.com or call (512) 652-4500.
©Copyright
2005 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
|