For years, many companies have been faced with challenges related to providing backup and recovery for infrastructures, technologies and networks in order to protect their critical business processing capabilities. Much of this effort has been driven by the need to provide a recovery design that was cost-effective and leveraged as much of the information technology (IT) investment as possible, while still providing adequate coverage to help protect the company from an adverse event that could compromise business results.
Years ago, when mainframes ruled, many companies opted for secondary data centers. Here, companies could offset key production workloads, oftentimes running tests and development on this secondary capacity to provide adequate capability to recover and fulfill their primary processing requirements.
Over time, this approach has become more difficult to manage and coordinate, as increased capacities became required to run the expanded workloads. Larger and more complex environments were being developed to run the work, while ever increasing amounts of data were being generated on a daily basis to meet the demands of the growing business requirements.
Additionally, the evolution of numerous distributed technology platforms, a multitude of systems software installations with varying operating levels, has been seen, as well as the advancement of networking technologies that enabled a “connect everything to everywhere all of the time.”
Many companies began to realize that the technology was becoming increasingly more difficult to sustain and manage. The result was not only the increased complexity of maintaining a recovery capability, but also the inability to manage a fully redundant secondary site, based upon a combination of financial, operational, and technological concerns that needed to be considered.
To help resolve these issues, commercial vendors introduced the concept of a shared recovery facility to be used by multiple businesses. A comprehensive infrastructure would be enabled with the necessary technology. It could be scaled to virtually any size and configuration required. This “hot-site” concept provided a pooled resource that was driven by discrete customer requirements, all managed by a third-party vendor in a remote location, separate from a company’s primary processing location. This was the first evidence that using a virtualization strategy for disaster recovery could be established.
Understanding the virtualization approach
At a high level, the main focus of a virtualization approach is the benefit that can be realized by consolidation. The vast numbers of servers, the amounts of storage, and the numerous networks could be combined in a managed pool of resources that could be configured based upon need. From a disaster recovery perspective, when a disasterous event occurs, resources from the larger pool can be reconfigured to provide capacity and access for assuming the primary production environment. While on the surface this appears to be a very attractive strategy, there are many underlying factors that should be considered.
Virtualization is an approach that has been used for many years for disaster recovery.
On the high end are mainframe platforms. Based upon individual business recovery needs, vendors provide clients with a pool of resources that are available via contract. To realize the optimum benefit of the hardware, virtual machine (VM) operating systems are utilized to enable multiple production partitions to be run on one physical machine for disaster recovery. This can provide the ability to leverage a single contracted machine for the recovery of multiple environments simultaneously. It also allows a company to contract only the technology that is required from the shared pool of resources.
Over time, greater requirements for distributed processing recovery became evident and warranted action. Companies began to see techniques deployed that utilized software to encapsulate virtual machines, which could be restored on device-independent hardware at the recovery site. These new techniques made it easier to identify and script a recovery process that was dependent upon how the backups were defined versus adhering to very rigid hardware-specific requirements. Assuming capacity, storage, and interfaces were adequate enough to provide equal or greater throughput for the individual production workloads, the result was that numerous virtual machines could be recovered onto a single physical footprint.
A properly constructed virtualization strategy can yield results
Many benefits can be realized using a properly constructed virtualization strategy for disaster recovery. Potential benefits include:
- The ability to create a virtual resource pool that provides multiple reuse scenarios, effectively producing a “pay once, use many times” scenario for asset utilization
- A cohesive technology platform that utilizes similar technologies for production, test and development, and recovery processing
- Standardization of processes and procedures for the entire resource pool using a unified approach to monitoring and management across all operations
- Consistencies in the technology investment as refresh cycles are propagated across all environments at the same time to help maintain the integrity of the pooled resources
- Ease of maintenance as the processes, schedules, and level of effort can be coordinated in a more timely and efficient manner, given the affinity across the installed technology base.
Key considerations for your virtualization approach for disaster recovery
When utilizing a virtualization approach for disaster recovery, there are several key considerations that should be incorporated into the design:
1. Capacity for recovery – It is critical to allow for adequate capacity when designing for recovery. Frequently, it is assumed that less than 100-percent capacity will be tolerable during a recovery event. In all actuality, during the initial phases of the recovery, utilization is greater than the production capacity as workloads push the limits of the systems to fully recover. In addition, considerable catch-up work must be run to bring the systems back to their pre-event status, all the while handling the new workload that is part of the business resumption process.
2. Resources for integration – While processing capacity represents a large portion of recovery consideration, attention should be given to the various other components required to support the production environment. These components include processor resources (storage, device interfaces, etc.), disk resources (storage arrays, storage area networks [SANs], clusters, etc.), peripherals (control units, terminals, blades, etc.), infrastructure (external switches) and network connectivity (switches, bandwidth, etc.).
3. Isolation, network redundancy and scalability – Key to avoiding single points of failure is helping to ensure that the design of the virtualized resources is isolated from the primary production environment. Network redundancy is crucial to providing access for internal users as well as all external parties – customers, business partners, supply chains, etc. The ability to scale is a requirement to handle peak workloads for both recovery and production processing.
4. Recovery plan execution – A major consideration in the design of a virtualized recovery strategy is the ability to actively test the plan. This includes the capability to fully test at a system level, effectively repurposing all workloads residing on the virtual resources for an extended period of time. This allows for integrated business and infrastructure validation. While function and component testing can be easier to schedule, true results may never be realized and could effectively compromise the recovery efforts.
5. Repurposed workload plan – A detailed plan should be put in place to manage the workloads that will be moved at the time of a recovery event – be that an exercise or an actual disaster. These plans should include a formal schedule for testing with senior executive commitment, an alternative work plan for the resources being offset at the time of the event, a daily backup process for the workload being offset, and a tested recovery plan for reestablishing this workload at an alternate site at the time of the event.
6. Disaster recovery posture retention – Consideration should be given to the risk profile of the business when implementing a virtualized recovery design. Geographic diversity should not be sacrificed in light of any technological distance limitations that may be inherent in a virtual design. Examples include the ability to enable a processor failover scenario and the requirement for synchronous data transfer that can help minimize latency concerns. In the end, the site of the recovery should be in line with the business’s tolerance for risk in accordance with its formal mitigation strategy, and should not be a result of satisfying a technical requirement.
7. Clearly identified workloads – Prior to identifying the specific resources that will constitute the virtualization pool, it is very important to understand the workloads that will be recovered at the time of an event. Business prioritization and criticality should be identified, with a detailed mapping in place relative to process flows, application integration and dependencies, and underlying information technology components, to help enable recoverability within the virtualized environment.
8. Disciplines for maintaining integrity – Strict systems management disciplines that include problem, change, incident, configuration, and asset management are a prerequisite that should be in place prior to engaging any new strategy for virtualized recovery. These are vital in preserving the integrity of the recovery environment and are critical for effectiveness in the ultimate operation, monitoring, and maintenance of the virtualized resource pool.
9. Business and IT reporting – The ability to track progress, delivery status, and report results of the recovery program is an important output of any recovery program. This may be important in justifying the significant capital investment being made in transforming the information technology function into a virtualized core utility for the business.
Virtualization, while popular in today’s information technology discussions, is something that has been around for many years and has been used for both production operations and disaster recovery scenarios. Developing a disaster recovery strategy using a virtualized approach is as intriguing as it is challenging to design and implement. Keys to developing a strategy include:
- Leveraging the IT investment for multiple purposes (production, development, or disaster recovery)
- Understanding the true production requirements to help ensure that adequate processing capabilities are in place for business protection
- Defining a separate and isolated infrastructure and network that can scale to meet production specifications
- Developing adequate test plans and schedules for comprehensive exercising and validation of the recovery capability
- Incorporating strict disciplines that can not only manage the integrity of the two environments but can report the status of the efforts from a business and IT standpoint.
Joseph E. Starzyk, PMP, is a senior business development executive with IBM’s Business Continuity and Resiliency Services and has more than 28 years of overall IT experience. Comments about this article may be sent to jestarz@us.ibm.com.
"Appeared in DRJ's Spring 2009 Issue"




