While basic system and network availability may be restored, an organization may be effectively out of business due to performance degradation, even while DR or service providers have met their contractual SLA commitments. Service level agreements (SLAs) that guarantee not only availability but also performance are becoming more commonplace, and user expectations more demanding.
Within the next year or so, it is also likely that end-to-end performance SLAs will cease being a differentiator but rather a basic service offering that every provider must deliver. While it may be stating the obvious, without parallel efforts to introduce performance and SLA management into disaster recovery plans, an organization’s ability to continue doing business is at significant risk. Unfortunately, many IT groups fail to plan for managing DR performance because they are so focused on establishing basic connectivity and availability.
What are some of the key considerations to account for in addressing DR performance and SLA management? For organizations using third-party DR vendors, managing performance can be complex. First of all, most performance and SLA management tools are often far too complex to implement in a fast-moving DR situation, and even then may be too expensive to justify because most companies only run DR testing once a year. Some performance management solutions can take years to implement, using an army of expensive consultants, and require highly skilled personnel to operate successfully. The challenge of using them in a short-term DR environment is daunting enough to deter many senior management technicians.
Ensuring application performance and reliability in a DR situation requires an out-of-the box toolset that is also scaleable and extensible to enterprise requirements. A tool that works well for day-to-day operations in your primary IT environment may not be well-suited for monitoring DR performance. If you plan on using the same toolset, make sure you have negotiated a license agreement with the vendor that allows for deployment and use both in an actual disaster recovery situation, as well as for testing. Depending on your environment, one option to consider may also be MSP services, which your DR provider may be able to deploy quickly. Depending on the applications involved in your DR plan, it may also make sense to use a combination of inside-out and outside-in performance monitoring, particularly if your customers or business partners access applications via the Web.
In addition, if your third-party DR provider has a tool they already use and you have access to, make sure it can accommodate multiple users via a Web-based interface, that you can review performance results in real-time, and that it can generate and share reports among all relevant constituencies. A careful review of performance and SLA monitoring solutions is necessary to find the toolset that best fits your DR plan and environment.
In cases where an IT organization uses its own facilities for disaster recovery and testing, selecting and configuring a performance management tool to track SLA compliance involves a slightly different set of requirements. One of the most important criterion is to have a monitoring system that can maintain a fail-over backup database capability, allowing performance management to continue on a virtually uninterrupted basis without the need for lengthy setup, implementation and configuration. The SLA and performance management tool should be able to specify multiple backup databases (primary, secondary and tertiary) and have the ability to synchronize data across them. For example, your tool should be able to take advantage of the publish-subscribe mechanisms in SQL or Oracle to create distributed clusters of performance management data accessible to backup monitoring systems. In addition, when creating real-time fail-over backups for your management tool, make sure that you are synchronizing not only the performance results data, but also the critical configuration and provisioning data your toolset uses to do the actual monitoring, alerting, diagnostics and SLA analysis.
An additional key consideration is to make sure there is a parallel recovery plan for the data collection mechanism itself, since a performance and SLA management solution is only as good as the metrics it collects. The metrics can include a variety of data sources for drill-down and diagnostics, but the most critical data sources are end-to-end application response time metrics, without which you cannot measure SLAs that are meaningful from the perspective of business continuity. In order to maintain continuous data collection, it is therefore important to deploy data collection software agents to points of presence in the DR architecture. If using your own backup facilities, this should be a fairly straightforward process. Again, if using a third-party DR vendor, then a key success factor will be your ability to deploy a solution “out-of-the-box.” Either way, it is also desirable to use a product that does not charge for the agent software itself, allowing you to deploy as many as you need, but rather bases the license on what you monitor, generally a more cost-effective methodology.
During the DR performance and SLA management planning process, one of the stickiest problems to solve, and a major obstacle in terms of bringing a toolset up quickly into DR production, is how to provision the system. The first step here is to use your existing DR assessment process to track the application, server and network services that need monitoring, and then build a plan to provision monitoring in advance. Provisioning a performance management tool tells the system which network, system and application services it needs to monitor, alert, analyze and report on. However, the major configuration challenge for timely DR performance management rollout is that DR systems and network components are by definition different than the primary infrastructure they replace. Use of a third-party DR vendor may compound the problem because of the different facilities used, versus an in-house system under the IT department’s control. For monitoring services such as Web sites, provisioning is less of an issue because DNS resolution allows the system to use identical URLs. But for network services or anything based on IP addresses, the challenge is more formidable. One approach is to maintain a translation table of primary and DR resources that will need monitoring and SLA analysis. The ability to maintain provisioning information that can be swapped out with an up-to-date list of DR resources, such as application and database names, IP addresses, servers and network components (e.g., DNS servers or VPN gateways) is critical to the success of a DR performance monitoring plan.
Once you select the appropriate tool and/or service offering, the next step is to begin planning how to use it during DR testing. During DR testing itself, “best practices” require that you build a performance baseline against which to measure progress, set thresholds and build SLA policies. Without an adequate performance baseline of the DR architecture, it is impossible to guarantee that the DR architecture will provide performance service levels equivalent to your primary network, systems and application infrastructure, thereby ensuring a high level of business continuity. The next step is to compare the DR performance baseline against the normal performance and SLA compliance of your primary IT services, in order to tune alert thresholds and SLA policies, set expectations and facilitate discussions with your DR team and vendors. Getting an accurate baseline within the short windows available during DR testing is an important success factor for a performance-monitoring tool. Meet with your DR vendor to make any necessary SLA adjustments, and modify performance thresholds and SLA monitoring policies accordingly. Finally, review your plans twice a year to keep them current.
The direct benefits of DR performance monitoring is not only to measure effectiveness in terms of business continuity, but also to assist in evaluating and diagnosing any end-to-end service disruption issues during DR testing. In an actual DR situation, you will need diagnostic drill-down features to analyze not only high-level business SLA compliance, but also to perform triage on their root cause, in order to facilitate service restoration when problems occur. DR performance management will allow you to extend the service restoration concept from outages to brownouts, thereby increasing the IT contribution to overall business continuity. While basic system and network availability is always the first priority in any DR scenario, ensuring adequate performance is a key ingredient to keeping your business running and customers happy
Ivan H. Shefrin is founder and senior vice president of business development for Response Networks, Inc., a leading provider of proactive service level management solutions. He directs Response Networks’ strategic alliances in the areas of technology integration, co-marketing and joint sales initiatives. In addition, he leads the development of corporate strategy and business planning. For more information, call 800-677-7638 or visit www.responsenetworks.com.