|
DISASTER
RECOVERY
JOURNAL
P. O. Box 510110
St. Louis, MO 63151
(314) 894-0276
Fax: (314) 894-7474
Internet
www.drj.com
E-mail drj@drj.com
PUBLISHER &
EDITOR-IN-CHIEF
Richard L. Arnold, CBCP
richard@drj.com
SENIOR EDITOR
Janette Ballman
janette@drj.com
MANAGING EDITOR
Jon Seals
jon@drj.com
COPY EDITORS
Richard Sandhofer
richards@drj.com Pamela
Clifton
pamelaclifton@hotmail.com
ADVERTISING
Robert Arnold
bob@drj.com
_____________
Corporate
President/CEO
Richard L. Arnold, CBCP
richard@drj.com
Vice
President
Robert Arnold
bob@drj.com
CONFERENCE COORDINATOR
Patti Fitzgerald, CBCP
patti@drj.com
CONFERENCE REGISTRAR
Merce Knese
mercedes@drj.com
CIRCULATION
Laura Baugh
laurab@drj.com
INTERNATIONAL
CONTACTS
England: Thom Hetherington
Business Continuity
Phone: 0161-237-1007
thomh@tempus.demon.co.uk
Australia: Anthony J. Harvey
Journal of Business Continuity
Phone: 0011-613-953-0055-8
fax: 0011-613-953-0528
sector@notability.com.au
Japan: Shinji Hosotsubo
Quake Japan Co., Ltd.
Phone: 03-3215-2880
fax: 03-3215-2881
Brazil:
Jose Carlos Ferreira
Disaster Recovery Mercosul
Phone: 55
11 3666-9506
conc2000@uol.com.br
www.drms.com.br
|
|
Click
Here for a Printable Version
INDUSTRY
Re-Evaluating
Our Business Continuity Strategies
By ANDRE NOENCHEN,
CBCP
Recent terrorist attacks on the U.S. have forced us to look more
closely at our disaster recovery and business continuity strategies.
We are infinitely more aware of just how vulnerable and fragile our
economic and social infrastructures are. The need for truly capable
emergency response, business continuity and disaster recovery strategies
has been clearly demonstrated. Those continuity plans that are capable
of nothing more than passing an audit must be exposed and replaced with
strategies that are truly workable and meet the critical organizational
needs they are designed to support. For most organizations today, this
includes ensuring the availability of critical IT resources and services.
It is absolutely inconceivable to consider operating a large organization,
regardless of the industry, without the mission critical technical resources
utilized daily.
The availability of computer services is the life-blood of many organizations.
This dependency on critical applications and data is increasing at an
astonishing rate. Most large businesses would simply not be able to
function without the availability of the computer-based services that
enable the high levels of efficiency required in todays competitive
world. It is, in most cases, a foregone conclusion that in the event
of a disaster, restoration of the corporate computer and information
infrastructure must take place very early in the recovery effort. Availability
of computer applications is as fundamental to the on-going health of
a company as are its employees, electrical power or telephone services.
It is impossible to imagine doing business as usual without
the systems and applications we rely on.
Obviously, people are the most important element of any organization,
but without the supporting tools and technology, even the most talented
of people will not be able to meet the demands of todays competitive,
complex and fast-paced business climate.
The mean time to recover is a term well known by any business
continuity professional. Time is one of the key drivers that define
the recovery strategies needed to enable required functionality. The
mean time to recover duration has been shrinking steadily
for IT enabled resources, to the point where some applications simply
cannot go down at all. This high level requirement exists mostly in
military, critical infrastructure (power, telephone, etc.) and financial
environments today, where outages of even one hour could be catastrophic.
More and more though, availability requirements for businesses outside
these key areas are measured in hours, not days or weeks.
The changes in disaster recovery and business continuity strategies
have not kept up with the changes in business requirements and technology
for far too many organizations. The strategy of recovering critical
computer-based functionality from tape-based recovery strategies is
often unreliable, complex, and time consuming. This is especially true
when recovery strategies are required to support site-based outages,
where an entire infrastructure needs to be recreated. The difficulty
of rebuilding and restoring from tape a dynamic IT infrastructure that
has taken years to evolve all in a 24- to 48-hour time frame,
and under adverse conditions is simply unrealistic.
How Did We Get
Here?
The above paragraphs contain information that is common knowledge for
some in the business world, and should be understood intimately by anyone
in the disaster recovery or business continuity field. Yet the true
recovery capabilities and strategies of most organizations fall short
of the business objectives that originally drove them.
The sheer growth in the volume of data we are managing has made backup
and recovery from tape extremely difficult.
Tape backup is still a critical part of any data management process.
Tape backup and restore will continue to play a key role in data recovery,
data management, and archiving. But the role of tape backup and restores
in business continuity is changing, and in many cases, will be all but
eliminated.
The proliferation of open systems architectures lead the
charge away from the proprietary mini and mainframe oriented data processing
floors, to UNIX and Windows based environments. The business world very
quickly embraced the nimble and less expensive alternatives
to the proprietary mainframe based data centers, moving to a client/server
approach to developing new applications, and porting old applications
to this new infrastructure.
The Internet, fueled by incredible advancements in networking, has changed
the way we do business, and even the way we live. The ease of doing
business on the Web coupled with the relatively low cost of deployment
has further fueled this growth. Many jobs of old have been replaced
by the functionality and speed offered by todays computer-based
solutions.
The technological advancements we have witnessed in the computer industry
in the past 10 years are truly staggering. These advancements have touched
so many aspects of the industry that nothing is as it was a decade ago.
Performance, capacity, bandwidth, scalability, redundancy, manageability,
interoperability, and reliability are just some of the key areas where
this remarkable industry has made exponential gains. We have gone from
measuring our disk storage space in megabytes to gigabytes, terabytes
and now petabytes. A single desktop computer can offer CPU performance
greater than that contained in an entire glass house datacentre only
15 years ago. The network bandwidth capabilities have seen similar growth
and performance gains, not to mention the impact business has experience
via the proliferation of the Internet.
The sheer volume of data we are dealing with has grown exponentially.
The types of data that exist are diverse, often requiring different
backup and restore tools and methods. Full system restores from tape,
including the operating system, network, applications, and databases,
is a very complex task and time consuming task. In fact, to accomplish
a full system restore, your strategy for recovering from tape must be
considered when designing and building the system in the first place.
The way in which backups are done (full backup, differential, incremental,
database, etc.) will dictate and limit the way in which they can be
used in a recovery. The emphasis in many environments is to speed up
the backup process to fit within specific maintenance windows. Usually,
methods that speed up backup (spreading data over multiple tape drives,
multiplexing numerous backups on one tape, incremental backups, differential
backups, etc.) will slow down the restore process later on. Also, many
centralized backup products store data on tape in proprietary formats.
This requires the recovery of the tape backup/restore infrastructure
prior to being able to utilize it to recover your actual data systems
and data an often overlooked issue.
Enterprise IT environments evolve over time; they are not purchased
as a turnkey solution. In many large computer shops, brand new state-of-the-art
systems are sitting beside and inter-operating with 10-year-old legacy
systems. These systems, applications and services can evolve over a
long period of time, requiring the talents of many people with diverse
and unique skills. These environments seem to be in a constant state
of flux, growing, changing, and moving. Often, critical components within
an infrastructure were implemented years ago by someone no longer with
the organization and without proper documentation. This does not pose
a problem until something goes wrong. In most isolated cases, existing
staff can resolve these problems. But in a disaster scenario, where
many elements of a critical application must be recovered simultaneously,
there may be a number of these unknown entities that need to be rebuilt.
In a trouble-shooting scenario, the number of variables that exist can
exponentially affect the time required to restore things to an operational
state.
The tools and applications that an end user or customer sees today are
typically provided via a combination of servers, operating systems,
applications, databases, networks, protocol, and services. The complex
sequence of events that culminate in a client receiving an e-mail are
rarely seen or understood by most people, not to mention what happens
in more complex transaction-based environments containing multiple data
bases, application servers, networks, and clients. The number of single
points of failure that exist in this sequence of events is staggering.
The precision with which the individual parts need to work together
to present a single application is complex and critical.
The move to open systems from proprietary systems
has further fueled this storm of change. New products and vendors are
appearing on and disappearing from the landscape on a
regular basis. Interoperability (the ability for one product to function
with another) was a declaration made to ease the fears of integration
of a variety of disparate products. The question usually asked vendors
was, Can your product do X? and the answer was usually a
resounding, Absolutely! Perhaps the question should have
been, What does it take for your product to do X? The difficulty
of implementing and managing a diverse and highly interdependent open
IT infrastructure can often cost more in terms of quality, dollars,
staff, manageability and time than what was a similar solution
on a proprietary infrastructure. That said, it is clear that this technology
has fueled the creative fires of the industry, injecting new ideas,
capabilities and vision it is here to stay.
The philosophies around emergency preparedness in business have also
undergone changes, albeit more subtle. The term disaster recovery is
used less frequently in favor of business continuity. The message here
is clear recovery implies that something has been disrupted or
stopped, and has been brought back to a functional state. Continuance
implies no disruption a much-preferred condition.
The primary model for vendor-based hot-site services is that of a subscription
to a pool of shared resources. That is, the vendors allow a number of
subscribers to contract against a shared pool of resources at a defined
rate. For example, there may be 30 subscribers to a single resource
(be it a mainframe, UNIX server, network, desktop, etc.). Should a disaster
occur affecting numerous customers, a first-come, first-serve approach
to providing access to the subscribed to resources is in effect. Most
large recovery vendors have resources to accommodate more than one customer
declaring a disaster at one time, but there is a limit. There are typically
no guarantees that you will have access to the equipment, space, and
resources you have subscribed to in the event of a disaster.
For subscription-based hot site and recovery centers, the ability to
test your recovery plans is also limited. Typically, a 24-hour annual
test window is provided, but this number is contract specific and can
be negotiated. More test time usually means more money. The time allotted
to testing is often insufficient to fully rebuild and properly test
your recovery strategies. Travel often comes into play for recovery
personnel, both in a disaster or test scenario. The equipment you will
be using is usually shared between many customers, making customization
difficult. Revision levels, change, and currency of the systems can
also add to the challenge. Your recovery procedures can be further complicated
by the need to allow for variance in the target systems you will be
recovering. A key point here is a lack of control of the target recovery
environment. These are difficult problems to manage in open systems
architectures, where supported configuration issues are complex and
somewhat dynamic.
The following are just some of the issues and facts that contribute
to the challenges of creating and maintaining true disaster recovery
and business continuity capabilities:
Increased reliance
on computer systems, data, and infrastructure
7 x 24 a daunting requirement!
Greatly decreased downtime and maintenance windows
Requirement for increased application performance in direct conflict
with increase in volume of data
Impacts of outages can affect a number of areas including health
and safety, security, financial, regulatory and legal requirements,
customer satisfaction, employee satisfaction
Manual fallback procedures often no longer viable in disaster
scenario
Huge increase in amount and types of data, directly impacting:
IT budgets
Disk, tape, and physical space requirements
Network traffic
System requirements (CPU, memory, etc.)
Differing storage strategies (SAN, NAS, Direct attached, etc.)
Management of data
Performance
Backup and restore times
Diverse data types, adding to complexity (platforms, versions,
databases, applications, networks, etc.)
Increase in complexity
Critical interoperability and compatibility issues
Large number of vendors and products
FUD (Fear, Uncertainty, Doubt) affects managers forced to select
and justify decisions based on vendor marketing
Need for thorough, comprehensive, and current documentation greatly
increased, yet documentation is typically low on priority list
significant risk
Recovery and functionality of single applications rely heavily
on large number of disparate systems (servers, databases, clients, network,
storage, supporting applications, etc.)
Increase in number and types of threats to IT
Human error (the biggest threat to IT)
Increase in malicious acts (terrorism, hackers, viruses, disgruntled
employees, etc.)
More eggs in one basket due to consolidation of data
centers
Rapid degree of change
Corporate growth (downsizing, acquisitions and mergers)
Technology changes (i.e. SAN, Internet, Voice Over IP)
Potentially short product lifecycles (quick obsolescence)
Market changes require dynamic environments
Philosophical changes (i.e. outsourcing, e-business, centralization,
de-centralization)
Some other key elements affecting IT environment
Huge growth in number of people employed in IT over a relatively
short period of time (dilution of talent)
Eroding budget vs. increased expectations
Increase in technology vs. decrease in staff
Employee turnover
Training requirements (rapid obsolescence)
The combination of
a number of factors in the industry has given us a false sense of security
about our IT infrastructure and its continued well-being. Although it
is true that the frequency of system failures, data and application
outages has decreased, the difficulty of recovering via conventional
means has much increased, and the impact of an outage, should one occur,
can be catastrophic.
The biggest threat to the ability to recover from a disaster is the
lack of practicing and testing of the procedures required to accomplish
this most critical and difficult task. One of the key contributors to
this misunderstood problem is, oddly enough, the increase in reliability
of many of the products we deploy within our infrastructure today; for
example, resilient disk storage systems (Raid5, mirroring, etc.), clustered
server systems, database journaling, etc. Although the likelihood of
a technical failure can be greatly decreased via high availability strategies,
the ability to recover from an incident that HA cannot guard against
has greatly diminished.
In all likelihood, the high availability environment within your organization
took a large amount of resources to implement, from people, to dollars,
to time. Expecting this type of environment to be recovered in hours
without significant resources being directed to this capability prior
to an incident is unrealistic.
So, What Now?
In a disaster scenario, you may not just be dealing with the need to
fully recover a server from scratch, likely you will be doing so on
hardware that is different to the production server you are trying to
recover. Further, there may be numerous peripheral and supporting servers
and services required to make critical applications functional, not
the least of which will be network related. Compatibility and configuration
issues can be overwhelming. The difficulty level goes through the roof
when doing this under extreme pressure, likely without access to your
existing environment, or perhaps even without some of the key personnel
that built and managed your production environment.
Leveraging the high availability products, services and techniques available
today is key to a viable business continuity strategy. A combination
of resilient storage subsystems, clustered system environments, redundant
networks, and a host of other HA products and techniques will pave the
road to 7x24-hour operation, even in the event of a disaster. The key
element in most of these solutions is that two identical and current
copies of production data exist in at least two separate locations.
This can be accomplished via a combination of redundant (mirrored) disk
storage hardware, software and processes. Some organizations have leveraged
their existing hot-site service providers to house and maintain infrastructure
dedicated to them. This is a viable option for those organizations that
do not have a second site where they can stage their own hot-site. These
types of strategies will, in most cases, eliminate the need to restore
production data from tape in a disaster scenario.
Keeping configurations standard and as simple as possible is an absolute
requirement for long-term recovery and availability capabilities. In
the complex and interdependent environments of todays IT, the
less customization the better. Further, with the frequent staff changes
and typically inadequate documentation, intuitive and standard configurations
are more easily supported and quickly recovered.
Change management processes must be implemented, strictly adhered to,
and enforced. This must be mandated by the most senior level of management
possible. ITIL (Information Technology Infrastructure Library) provides
a framework for better managing IT, and includes some best practices
in regards to change management.
New technologies such as SANs (storage area networks), fibre channel,
networking, high availability disk arrays, and various disk virtualization
and management products provide the means to bring customized hot-site
solutions to organizations that would not have been able to afford these
solutions in the past. In fact, many cant afford not to implement
these types of solutions.
The cost of implementing in-house hot-site solutions has decreased dramatically.
The cost per megabyte of storage, along with the relative cost of bandwidth
between two sites is at a level where this type of solution is becoming
more palatable financially every day. In fact, the cost of most computer
hardware products has been dropping steadily over the past few years.
Throwing hardware at a solution can often be more cost effective,
less time consuming, and less complex than having people develop, implement,
maintain, and manage complex recovery solutions.
Leveraging internal test and development environments for disaster recovery
purposes is also a viable alternative. The equipment typically used
for testing can also be used in a disaster scenario. But be very careful
when implementing a multi-function environment such as this. Very strict
rules of engagement and change management must be deployed to ensure
that one objective does not negatively affect the other. Again, strong
management is a key to successfully implementing this type of strategy.
The Bottom Line
The benefits of implementing mirrored site solutions go far beyond mere
recovery capabilities. Other benefits can include cost savings, reliability,
centralized management, scalability, decreased maintenance windows,
performance and security.
In these ever changing times, we as professionals must be vigilant in
protecting the organizations for which we work. Perhaps now more than
ever is a good time to review the processes, procedures and tools we
employ to guarantee the continued viability of our businesses. It is
difficult to keep up with technological changes and advancements, but
not investigating the possibilities, and making false assumptions about
the viability and affordability of new and better recovery and continuity
capabilities, is the biggest risk of all. We have always done
it this way is not a convincing argument to why we are doing things
in a specific way today.
Andre Noenchen currently
works for Infostream Technologies Inc. as a senior consultant, specializing
in the design and implementation of high availability SAN-based infrastructures.
To comment on this
article, go to 1502-05 at www.drj.com/feedback.
|