|
|
||
|
DISASTER
RECOVERY
_____________ Corporate President/CEO Vice
President
CONFERENCE REGISTRAR Brazil:
Jose Carlos Ferreira
|
Click Here for a Printable Version SERVICE LEVEL AGREEMENTS Ensuring Application,
SLA Performance In Disaster Recovery By IVAN H. SHEFRIN The primary objective
for IT disaster recovery planning is not only to guarantee application
availability, but more broadly, to ensure business continuity
the ability of an organizations employees, customers and suppliers
to continue doing business with minimal disruption and maximum efficiency.
A significant portion of the cost associated with IT disaster recovery
(DR) is in fact driven by the degree to which an organization must avoid
service disruption to these constituent groups. Much of the focus in
DR planning justifiably goes toward the labyrinthine task of ensuring
basic availability and uptime, in order that the same applications and
data that end-users employ on a daily basis are available during disaster
recovery situations. However, in planning
for disaster recovery, users should keep in mind that availability is
only the tip of a very dangerous iceberg the most visible obstacle
of a much larger challenge that mostly lies unseen. Below the obvious
surface of availability lies the problem of application performance
and brownouts, a treacherous problem that can potentially sink the best
laid DR plans. The reason for concern is that performance degradation
application brownouts are much more difficult than downtime
to manage and contain. At the same time, their business impact can be
just as significant. Brownouts can have
a significant impact on the ability of a DR plan to ensure business
continuity, particularly when compared with the performance experienced
using primary instead of DR systems. A brownout can actually do much
more damage to an end-users business than hard downtime, primarily
because brownouts occur more often and are harder to diagnose and correct.
In addition, brownouts usually affect only a subset of customers at
one time, and seldom represent an outright service outage. They are
more likely to appear as intermittently poor response times during at
certain periods, by certain applications and from particular locations.
But if performance degrades during peak business hours (e.g., at the
time a business takes most of its orders, or a bank clearing funds at
the end of the day), even while overall availability SLAs may be met,
the practical effect is that the application is effectively unavailable. Because of their potentially
significant impact on business continuity, monitoring and diagnosing
brownouts to maintain application performance service levels should
be an integral part of disaster recovery planning. Without such planning,
for example, the number of transactions an IT system can execute may
drop dramatically. While basic system
and network availability may be restored, an organization may be effectively
out of business due to performance degradation, even while DR or service
providers have met their contractual SLA commitments. Service level
agreements (SLAs) that guarantee not only availability but also performance
are becoming more commonplace, and user expectations more demanding. Within the next year
or so, it is also likely that end-to-end performance SLAs will cease
being a differentiator but rather a basic service offering that every
provider must deliver. While it may be stating the obvious, without
parallel efforts to introduce performance and SLA management into disaster
recovery plans, an organizations ability to continue doing business
is at significant risk. Unfortunately, many IT groups fail to plan for
managing DR performance because they are so focused on establishing
basic connectivity and availability. What are some of the
key considerations to account for in addressing DR performance and SLA
management? For organizations using third-party DR vendors, managing
performance can be complex. First of all, most performance and SLA management
tools are often far too complex to implement in a fast-moving DR situation,
and even then may be too expensive to justify because most companies
only run DR testing once a year. Some performance management solutions
can take years to implement, using an army of expensive consultants,
and require highly skilled personnel to operate successfully. The challenge
of using them in a short-term DR environment is daunting enough to deter
many senior management technicians. Ensuring application
performance and reliability in a DR situation requires an out-of-the
box toolset that is also scaleable and extensible to enterprise requirements.
A tool that works well for day-to-day operations in your primary IT
environment may not be well-suited for monitoring DR performance. If
you plan on using the same toolset, make sure you have negotiated a
license agreement with the vendor that allows for deployment and use
both in an actual disaster recovery situation, as well as for testing.
Depending on your environment, one option to consider may also be MSP
services, which your DR provider may be able to deploy quickly. Depending
on the applications involved in your DR plan, it may also make sense
to use a combination of inside-out and outside-in performance monitoring,
particularly if your customers or business partners access applications
via the Web. In addition, if your
third-party DR provider has a tool they already use and you have access
to, make sure it can accommodate multiple users via a Web-based interface,
that you can review performance results in real-time, and that it can
generate and share reports among all relevant constituencies. A careful
review of performance and SLA monitoring solutions is necessary to find
the toolset that best fits your DR plan and environment. In cases where an
IT organization uses its own facilities for disaster recovery and testing,
selecting and configuring a performance management tool to track SLA
compliance involves a slightly different set of requirements. One of
the most important criterion is to have a monitoring system that can
maintain a fail-over backup database capability, allowing performance
management to continue on a virtually uninterrupted basis without the
need for lengthy setup, implementation and configuration. The SLA and
performance management tool should be able to specify multiple backup
databases (primary, secondary and tertiary) and have the ability to
synchronize data across them. For example, your tool should be able
to take advantage of the publish-subscribe mechanisms in SQL or Oracle
to create distributed clusters of performance management data accessible
to backup monitoring systems. In addition, when creating real-time fail-over
backups for your management tool, make sure that you are synchronizing
not only the performance results data, but also the critical configuration
and provisioning data your toolset uses to do the actual monitoring,
alerting, diagnostics and SLA analysis. An additional key
consideration is to make sure there is a parallel recovery plan for
the data collection mechanism itself, since a performance and SLA management
solution is only as good as the metrics it collects. The metrics can
include a variety of data sources for drill-down and diagnostics, but
the most critical data sources are end-to-end application response time
metrics, without which you cannot measure SLAs that are meaningful from
the perspective of business continuity. In order to maintain continuous
data collection, it is therefore important to deploy data collection
software agents to points of presence in the DR architecture. If using
your own backup facilities, this should be a fairly straightforward
process. Again, if using a third-party DR vendor, then a key success
factor will be your ability to deploy a solution out-of-the-box.
Either way, it is also desirable to use a product that does not charge
for the agent software itself, allowing you to deploy as many as you
need, but rather bases the license on what you monitor, generally a
more cost-effective methodology. During the DR performance
and SLA management planning process, one of the stickiest problems to
solve, and a major obstacle in terms of bringing a toolset up quickly
into DR production, is how to provision the system. The first step here
is to use your existing DR assessment process to track the application,
server and network services that need monitoring, and then build a plan
to provision monitoring in advance. Provisioning a performance management
tool tells the system which network, system and application services
it needs to monitor, alert, analyze and report on. However, the major
configuration challenge for timely DR performance management rollout
is that DR systems and network components are by definition different
than the primary infrastructure they replace. Use of a third-party DR
vendor may compound the problem because of the different facilities
used, versus an in-house system under the IT departments control.
For monitoring services such as Web sites, provisioning is less of an
issue because DNS resolution allows the system to use identical URLs.
But for network services or anything based on IP addresses, the challenge
is more formidable. One approach is to maintain a translation table
of primary and DR resources that will need monitoring and SLA analysis.
The ability to maintain provisioning information that can be swapped
out with an up-to-date list of DR resources, such as application and
database names, IP addresses, servers and network components (e.g.,
DNS servers or VPN gateways) is critical to the success of a DR performance
monitoring plan. Once you select the
appropriate tool and/or service offering, the next step is to begin
planning how to use it during DR testing. During DR testing itself,
best practices require that you build a performance baseline
against which to measure progress, set thresholds and build SLA policies.
Without an adequate performance baseline of the DR architecture, it
is impossible to guarantee that the DR architecture will provide performance
service levels equivalent to your primary network, systems and application
infrastructure, thereby ensuring a high level of business continuity.
The next step is to compare the DR performance baseline against the
normal performance and SLA compliance of your primary IT services, in
order to tune alert thresholds and SLA policies, set expectations and
facilitate discussions with your DR team and vendors. Getting an accurate
baseline within the short windows available during DR testing is an
important success factor for a performance-monitoring tool. Meet with
your DR vendor to make any necessary SLA adjustments, and modify performance
thresholds and SLA monitoring policies accordingly. Finally, review
your plans twice a year to keep them current. The direct benefits of DR performance monitoring is not only to measure effectiveness in terms of business continuity, but also to assist in evaluating and diagnosing any end-to-end service disruption issues during DR testing. In an actual DR situation, you will need diagnostic drill-down features to analyze not only high-level business SLA compliance, but also to perform triage on their root cause, in order to facilitate service restoration when problems occur. DR performance management will allow you to extend the service restoration concept from outages to brownouts, thereby increasing the IT contribution to overall business continuity. While basic system and network availability is always the first priority in any DR scenario, ensuring adequate performance is a key ingredient to keeping your business running and customers happy
Ivan H. Shefrin is founder and senior vice president of business development for Response Networks, Inc., a leading provider of proactive service level management solutions. He directs Response Networks strategic alliances in the areas of technology integration, co-marketing and joint sales initiatives. In addition, he leads the development of corporate strategy and business planning. For more information, call 800-677-7638 or visit www.responsenetworks.com. To comment on this article, go to 1503-13 at www.drj.com/feedback.
|