Effective IT disaster recovery (DR) and business continuity planning is essential for every business. All businesses depend on their IT services for moment-to-moment operations. So they must all take measures to ensure that those services are not disrupted due to a natural or man-made disaster.
However, because IT environments have become so complex – and because IT budgets are stretched so thinly – IT organizations face real challenges as they attempt to safeguard the business against such disasters. These challenges include:
1. Ensuring that planned DR measures will actually meet the needs of the business if and when they are called upon.
2. Determining exactly when such measures should, in fact, be activated.
3. Documenting to senior management, auditors, insurers, and/or regulators that best efforts have been made to protect the business from a full range of potentially disruptive eventualities.
All of these challenges can be addressed by pre-testing DR plans in a simulated network environment. By performing these simulations, DR planners can cost-effectively assess the end-to-end performance of critical applications under any projected condition. Equipped with this insight, DR planners can make more informed decisions about how to best ensure business continuity – and can fully justify those decisions to any appropriate third parties.
Why is Disaster Recovery Planning so Hard?
While events such as the attacks of 9/11 and Hurricane Katrina have called greater attention to the possibility of catastrophic business interruption, businesses have always had to plan for the worst. IT’s role in this planning has grown over the years as IT services have become an increasingly critical component in day-to-day business operations. In fact, for many businesses, survival depends almost entirely on the ability to provide at least some minimal number of core IT services to at least some minimal number of end-users – regardless of where those end-users may be.
Several factors make it difficult to plan for the delivery of core IT services to end-users who may or may not be working at their usual locations:
u "Live" testing is often impractical. DR measures often include use of "hot sites," "cold sites," alternative network service providers, and other resources that can’t be tested "live" with any frequency – if at all. This makes it difficult, if not impossible, to re-assess such measures when applications are added or other types of changes are made to the enterprise computing environment.
u Business continuity depends on application performance. It’s not enough to simply give end-users in some remote location networked access to a server somewhere. They also have to be able to actually use the application they’re accessing. If the application behaves too sluggishly – or network latency causes it not to perform properly at all – the disaster recovery plan will fail.
u Application performance can be difficult to predict. Different applications respond in much different ways to bandwidth constraints, added latency, and intermittent connectivity. Without real insight into the idiosyncrasies of each critical application, DR planners may under- or over-provision contingency infrastructure.
u There are so many different contingencies to consider. IT has to be prepared for everything from an accidental cable cut to the complete destruction of the corporate data center. The sheer number of potential disaster scenarios can tax the ability of even the largest IT organization to assess them all and formulate appropriate recovery plans.
u DR planning resources are limited. Most IT organizations already have their hands full just maintaining the systems they already have in place and supporting strategic, high-ROI technology initiatives.
u It has to work right the first time. There’s no room for error when it comes to DR. If IT discovers a problem with a typical system upgrade or new piece of hardware, it can quickly roll things back to how they were before. But there’s nothing to roll back to in the event of a disaster. So IT has to have a high level of confidence in its contingency plans.
For these reasons and others, IT organizations often struggle to achieve confidence in their ability to fully protect the business from all possible contingencies. These factors can also make it difficult for IT to satisfy the demands of upper management and/or outside auditors for proof that adequate planning and testing has been performed. That’s why many IT organizations are actively seeking a better way to perform and manage their DR planning.
The Simulation Solution
One powerful way to improve DR planning is to pre-test disaster recovery scenarios using network simulation. Network simulation in a lab environment provides a safe and flexible means of observing end-to-end application performance under a full range of possible scenarios. It does this by accurately simulating conditions as they exist in the current production network environment – or as they might exist in a disruption or recovery scenario – so IT staff can observe, analyze, and experiment with applications and infrastructure.
Effective simulation must encompass all factors relevant to application performance across the network, including:
u All network links and their impairments (physical distance and associated latency, bandwidth, jitter, packet loss, CIR, QoS/MPLS classification schemes, etc.)
u The number and distribution of end-users at each remote location
u Application traffic loads
u Specific scenario events, such as the disruption of a specific network link
Ideally, it should be possible to directly import these attributes from the current production network and/or any provisioned DR resources – such as remote "hot" or "cold" site facilities and failover network circuits. It should also be possible to modify environmental attributes at will in the simulated environment in order to assess how changes in conditions – such as increased utilization or intermittent connectivity – will affect application performance.
Testing in a simulated network environment doesn’t just enable IT staff to measure functional attributes such as the utilization of a particular link or the congestion at a certain switch. By connecting the "virtual" simulated network to production servers running live applications, IT staff can directly observe and accurately measure the end-user’s experience at the desktop for each and every remote location.
This ability to accurately assess end-to-end application performance and to create a limitless range of "what-if" scenarios makes simulation-based testing an exceptionally powerful technology for DR planners.
Ensuring the Adequacy of DR Plans in Simulated Test Environments
The applicability of network simulation to DR planning is fairly obvious. Instead of having to actually "go live" with any given recovery scenario, DR planners can readily simulate those scenarios in the lab. This is far less expensive and much more convenient than activation of DR sites or backup network circuits.
Using this kind of testing environment, DR planners can make a variety of determinations about projected contingency scenarios, including:
u Does the DR plan provide adequate application performance to the minimum acceptable number of end-users?
u Can the planned recovery solution meet the needs of all end-users at all remote locations?
u What is the maximum number of users the solution can support without bringing performance below acceptable levels?
u Are there any specific applications that the planned recovery solution can’t adequately support?
u Will additional bandwidth improve performance? Or are performance issues primarily related to other network characteristics, such as distance and latency? Does this mean that the hot site or secondary data center should be re-located?
u Will an alternative architecture help performance? How many servers would be required to support a given number of end-users?
Of course, the answers to these questions are likely to change over time. As a business grows, for example, the minimum number of end-users it needs to get through a typical day is likely to increase. The new and/or modified applications that get rolled out may have different performance characteristics – which mean they will have to be tested under a full range of DR conditions. With simulation technology, it is relatively easy to test and re-validate DR plans in light of these changes. Without it, IT faces the choice of either re-running costly production "fire drills" or waiting several months for the next scheduled one – potentially leaving the business vulnerable until then.
From a business perspective, it’s also important to note that this kind of testing eliminates the need to over-provision DR resources. With budgets tight, IT organizations can ill afford to spend more on DR infrastructure than they have to. By performing appropriate testing in a virtual environment, DR planners can effectively "right-size" such infrastructure and avoid inefficient resource allocation.
Determining When
To ‘Pull The Trigger’
When a flood or regional power failure forces the abandonment of a primary facility, then there’s no question that DR plans have to be activated immediately. But what if a less drastic problem occurs? What if a router starts to experience problems but doesn’t actually fail? At what point should the decision be made to cut over to an alternative network link? Should non-critical applications be turned off to preserve the performance of critical ones? And on what basis should such a decision be made?
These decisions can be difficult to make on the spur of the moment without adequate preparation. Many IT organizations don’t even know how these types of less-than-disastrous events will impact their critical IT services until they actually occur. The resulting confusion can wind up creating a business interruption that’s nearly as serious as an actual disaster.
Sometimes, such decisions have to be made on the basis of cost as well as application performance. The use of a secondary network provider’s infrastructure for failover, for example, may be very expensive. In these cases, a line-of-business manager or other executive may have to make the call – rather than someone from IT.
Simulation technology can be extremely useful for dealing with these kinds of "near-disaster" scenarios by showing exactly how infrastructure impairments will impact critical IT services – and how various alternative tactics can potentially remediate those impairments over the short term.
Simulation testing also helps IT organizations and the businesses they support prepare for such decisions by letting end-users and managers experience impaired application performance first-hand in the lab. It’s probably not very useful to ask a non-technical person, "Should we cut over to our alternative network service provider if latency on this circuit jumps from 5 milliseconds to 100 milliseconds?" A more reasonable approach would be to bring them to the lab and show them what the response of their primary applications will look like under the impaired condition – and to then ask, "Is it worth $20,000 to avoid this sluggishness on our end-users’ desktops for 48 hours?" Such a demonstration prepares managers for making appropriate decisions on short notice, so that the impact of such problems on the business’ bottom-line can be minimized.
Documenting DR Diligence
IT organizations may be quite diligent in their DR planning. But if they are not able to fully document that diligence to others, they may have difficulty responding to the inquiries of upper management and external auditors. In fact, if they can’t provide ready documentation validating their preparedness for multiple disaster scenarios, they may be asked to demonstrate the adequacy of their plans in ways that are both time-consuming and costly.
Documentation of DR diligence may become especially important in the event a disaster occurs and DR plans yield results that don’t fully meet the expectations of executives, board members, or stockholders. In both the pre- and post-disaster situations, DR planners must be able to show:
u Specific types of disaster scenarios have been considered
u DR plans for these various scenarios were appropriately tested
u To-the-desktop performance of all current critical applications was assessed
u Tests were performed based on appropriate utilization levels by an appropriate number of end-users
u Outcomes were approved by authorized managers
This kind of documentation is much easier to generate and manage in the context of a controlled virtual testing environment than it is with conventional "live" production trials.
Simulation testing supports good DR planning in many ways. It helps ensure the accuracy and adequacy of DR implementations. It helps IT and business managers understand exactly when they may need to put DR plans into action. And it enables full documentation of DR planning efforts.
As IT environments become increasingly complex – and as businesses become increasingly dependent on the services delivered across those IT environments – DR planning will become an increasingly critical and challenging discipline. DR planners should therefore strongly consider taking advantage of today’s simulation testing solutions. By doing so, they will be able to better protect the business, while at the same time taking cost and hassle out of core DR processes.
v
Amichai Lesser is the director of product marketing at Shunra, the pioneer and market leader in predicting how business applications and network services will perform for remote end-users – before deployment. Amichai is responsible for product marketing, market analysis, and field marketing programs, and has extensive experience in real-time engineering, performance management, and security. Amichai can be contacted at amichai.lesser@shunra.com. For more on Shunra, see www.shunra.com.
"Appeared in DRJ's Fall 2007 Issue"




