In a commissioned February 2007 study of 504 technology decision makers and influencers of organizations’ business continuity and disaster recovery efforts, titled “The Impact of the WAN on Disaster Recovery Capabilities,” Forrester Consulting wrote:
“Due to heightened risk, fiduciary responsibility, increased competition, and regulation, upgrading disaster recovery capabilities is a top priority for enterprises.”
“The challenge for enterprises now is to determine how they can optimize their existing disaster recovery solutions to the point that their recovery time and recovery point capabilities are measured in minutes, not hours.”
A major component of DR plans is protecting business-critical data through backups and data replication. Such replication and backup processes may occur between data centers, branch and home offices, or primary and backup sites. Data replication will be covered in detail later in this article, but first let’s review overall “best practices” for data recovery and its related components.
Organizations need a DR solution that enables:
- Superior application availability and performance
- Reduced management overhead
- Improved operational efficiency
The ultimate solution gives organizations an intelligent way to manage their data centers and the applications they host. Moreover, an effective solution needs to enable visibility and provide health checks for applications and data centers to enable uninterrupted access to Web services for users. In the event of a problem, the solution needs to automatically and transparently make adjustments, including seamlessly rerouting users.
Business continuity (BC) is upheld in disaster scenarios by addressing the following considerations:
- Holistic Monitoring – It’s not enough to check if the application is up or down. The solution must take a holistic approach, checking the application and factoring in all dependencies. Automating the failover process eliminates management overhead, minimizes the cost of downtime, and removes the guesswork involved in tracking interdependencies.
- Client Continuity – The solution should be able to direct users to the appropriate data center based on the state of the data center, application, Web service dependencies, and user identity. Tracking the application state is essential to making sure that the users are delivered the right content without broken sessions or lost data. An intelligent solution should also be able to maintain the user’s session by resolving the user back to the same data center, tracking the user’s identity, transaction history, and the dependencies between services.
- Service Management and Maintenance – Following best practice management guidelines, the solution should be able to intelligently track and manage dependencies in a multi-site application infrastructure. The most helpful management tools facilitate the identification and monitoring of the application infrastructure dependencies from a single locale for at-a-glance operational efficiency.
- DNS Management – The best solution should make the job of managing DNS simple and error free, especially because one minor configuration error can bring down an entire application infrastructure. Fixes to this problematic scenario include an easy-to-use user interface, DNS error checking, and automatic reverse lookups.
- Security – Organizations need a holistic, integrated, and transparent (from an administrative perspective) approach to secure the network and applications against potential threats and attacks.
The Core of a Successful BC/DR Plan
A successful BC/DR plan has two key components at its core: a solid replication product to manage replication processes, and an effective and efficient WAN that enables those processes to be accomplished successfully.
Two of the critical metrics used in measuring the success of a disaster recovery plan are recovery point objectives (RPO) and recovery time objectives (RTO). These two metrics measure the amount of data lost during a disaster and the time required to restore to normal operations. IT managers must counterbalance the lowest RTO and RPO possible with factors such as:
- Increasing data storage requirements from increased usage and regulatory archival requirements
- Limited bandwidth between primary and backup locations
- The expense of adding additional bandwidth between the DR locations
- Variable factors that can affect the performance of the DR solution over the WAN (e.g., WAN latency and packet loss)
One of the most common barriers to the effective deployment of any high-performance data replication solution is the performance of the solution over the WAN between DR sites. Storage teams, when sizing the bandwidth requirements, often find that their initial sizing estimates are insufficient to meet the performance requirements of a DR solution. In practice, true WAN performance is rarely given much thought until the organization ramps up their production replication system and realizes that the WAN bandwidth they have does not provide the expected throughput. Suddenly, the RPOs and RTOs they expected to meet are no longer realistic.
In addition, WANs have several inherent characteristics that are the source of missed expectations or unrealized objectives within replication scenarios. One of the most obvious pitfalls of a DR plan is latency: The limits of the speed of light and the number of network hops between the DR sites can slow recovery efforts simply due to the physical distance information must travel. In addition, existing network conditions can also hurt recovery objectives, with packet loss stemming from signal degradation, oversaturated network links, corrupted packets rejected in transit, or just faulty networking hardware. And depending on the reach of the disaster, network congestion can have a tremendous effect if an unexpectedly large number of users attempt to access data on the network.
It is also important to realize that – regardless of how intuitive it sounds – a simulated disaster is not the same as an actual one. Testing in a closed environment does not always yield the same results as any real-life disaster, and actual bandwidth (and performance) levels may not measure up to original estimates or projections.
Unfortunately, the factors mentioned above can often cripple a good DR plan. When the DR application shares the WAN links with other application traffic, file transfers, and even other migration or recovery activities, the RPOs and RTOs that were met previously can be completely unobtainable. This could be due to a variety of factors including congestion caused by the added throughput from the other applications. In addition, latency constraints due to extended distance between the DR sites can prevent storage teams from achieving their RPOs and RTOs regardless of how much bandwidth is used.
The most common “solutions” storage teams use to address these issues are to (1) replicate the most critical data, thereby reducing the amount of data replicated, and (2) increase the amount of bandwidth leased. Neither option is attractive since they do not solve the core issue, which is the performance of the application over the WAN.
Organizations need a DR plan based on real-time replication and automatic failover to provide cost-effective business continuity. The solution should provide continuous data protection by sending an up-to-the-minute copy of the data as it is being changed on the origin server to the target replication server. Adding an appliance that uses compression and acceleration technologies to dramatically improve the speed of application traffic over WANs is highly recommended to enhance disaster recovery efforts. The result is a solution that accelerates a wide variety of application traffic types including data replication, file transfer, e-mail, client-server applications, and many others. To maximize efforts, organizations should also evaluate how efficiently bandwidth is allocated across different applications to ensure that the most critical traffic receives priority access to valuable bandwidth.
Another key consideration is how well a disaster recovery solution fits in with an organization’s overall business objectives and business continuity plans. Acceleration and replication technologies can be used to streamline business processes in addition to safeguarding against disasters.
As the commissioned Forrester Consulting study mentioned above states:
“Often, the cost of deploying a WAN acceleration appliance at each end of the link is less expensive than the cost of increasing bandwidth.
“For enterprises that want to use replication or remote backup between remote sites with limited bandwidth to the corporate data center, WAN acceleration software is an appealing approach because the software can be installed on existing servers; the enterprise does not have to invest in a standalone appliance at each site.”
There are many advantages to using a WAN compression and acceleration appliance to augment the disaster recovery application:
- The combination helps meet RPOs and RTOs without upgrading bandwidth or replication infrastructure by:
― Accelerating the replication processes irrespective of the WAN conditions
― Enabling the network to adapt dynamically to network congestion levels
― Guaranteeing bandwidth for important and critical replication traffic
― Providing more control of WAN resources allocated to storage or DR needs
- It reduces the cost of meeting the RPOs and RTOs by:
― Using less bandwidth to replicate the same or more amounts of data
― Reducing the tangible and intangible costs associated with troubleshooting
- And it secures the replication traffic by:
― Encrypting the replication traffic using SSL encryption
Factors that can affect an acceleration appliance’s performance of the WAN include:
- The amount of redundant data traversing the WAN
- “Compress-ability” of the data (e.g., text is easily compressible, images are typically not)
- Traffic mix over the WAN links, though it is significant to note that an acceleration appliance can typically enforce bandwidth guarantees to significantly improve performance of the important traffic at the expense of the less important traffic
- Traffic volume and link utilization. The level of congestion over WAN links is affected by the change in traffic volume over the course of a day. Peak load times during which a replication process is brought to a halt can now be prevented using bandwidth allocation.
Putting it into Play: CitiStreet’s Application Acceleration Success Story
CitiStreet, one of the largest global benefits delivery firms, needed to maximize the performance of its network to backup 12 million participants’ data over the WAN on a daily basis. As its existing data lines began to reach capacity, CitiStreet knew it had only two options: retool their infrastructure, including adding expensive additional bandwidth, or deploy WAN optimization and application acceleration technology to make the best use of what they already had.
WAN optimization plays a key role in the technology backbone of CitiStreet’s business continuity preparedness plan, which guarantees CitiStreet customers the highest level of uninterrupted service and support. In addition to saving CitiStreet a substantial dollar amount per year in WAN service costs, WAN acceleration technology enables them to reap full value from its critical applications delivered over the WAN by providing predictable, reliable response times.
After installing the acceleration technology at each data center to accelerate applications and maximize throughput over the WAN, CitiStreet was able to reduce the time it took to replicate 6 GB of data from 55 minutes to nine minutes, and reduced bandwidth consumption from 16 Mps to 4 Mbps. As a result of the unique compression technology, 6 GB of data now appears as less than 300 KB, reducing the amount of data sent by 20x while simultaneously improving application performance.
The combination of acceleration appliances with data replication and best-practice recovery guidelines offer significant performance gains for organizations, including the ability to achieve recovery objectives.
As the aforementioned commissioned study from Forrester Consulting states:
“WAN acceleration not only helps to improve the RTO and RPO of individual applications or sites, but when taken together, it improves the overall disaster recovery preparedness of the entire enterprise.”
The end result is reduced risk and lower costs – two essentials for any BC/DR plan.
About The Author: Charlie Cano is a Solutions Architect at F5 Networks, the global leader in application delivery networking. F5 provides solutions that make applications secure, fast and available for everyone, helping organizations get the most out of their investment. More information can be found at www.f5.com.
"Appeared in DRJ's Spring 2008 Issue"