Shortly thereafter, ABC’s primary location was rendered inoperable due to a fire and all systems became inaccessible. ABC’s secondary site was a small branch office a few miles away that contained enough reserved servers to restore their most critical applications and data. Unfortunately, the only back-ups available were on tape. It took ABC’s IT team more than two days to retrieve their tapes and to restore the systems to an operational level. However, by this time ABC had failed to meet their internal and external recovery time and point objectives.
While only a hypothetical scenario, many business continuity professionals might consider ABC’s DR strategy to be quite solid in concept. Unfortunately, by not remembering some of the cornerstones of a sound and dependable DR strategy, the team would have failed to meet the requirements of both their employees and customers, potentially exposing themselves to significant risk. It’s important for organizations not to discard DR best practices with the introduction of new technology like virtualization. Many of the “old” or “traditional” rules still apply.
Identifying Virtualization’s Key Business Upsides and Risks
Migrating to a virtualized environment provides numerous advantages to an organization. Using virtualization based technologies can bring benefits to an organization in the areas of facilities and management. Server consolidation will result in a smaller footprint for an internal IT department – less physical servers, less power consumption, as well as management advantages of quicker and more flexible server provisioning/deployment and less physical assets to manage. However, what is often overlooked as part of a comprehensive virtualization strategy are the risks associated with deploying such a solution and how it affects an organization’s overall information availability strategy.
There are risks inherent in the adoption of any new technology and virtualization is no different. Often described as a panacea for data center sprawl, virtualization is not without its risks. Common risks of virtualization include the following:
- It Is New Technology – virtualization is still somewhat new and immature at the x86 level. While there have been great advancements in the past several months, many companies are still “dabbling” in non-production environments. New technologies often cause growing pains for organizations that take on too much too quickly; so it’s important to take time to learn from others’ mistakes.
- Security – With multiple applications running inside one physical machine, security, availability and recoverability become even more critical than before – essentially companies may be placing all of their eggs in one basket and should approach their information availability planning with care.
- Management – Many of the management tools embedded in the virtualization technologies are immature. Often times they need to be supplemented by additional third party management tools to provide a comprehensive solution for monitoring, management and change control.
- Consolidation – Although trumpeted as the key benefit, consolidation can create single points of failure. It is critical for companies to not over-consolidate as this may adversely impact application performance and stability. In addition, for those companies that are not only consolidating servers locally, but also are collapsing multiple data centers into one, it is equally important to ensure the appropriate level of supporting infrastructure such as network, power and cooling.
- Specialized Skill Set Required - To be able to implement a virtualization solution for any organization, the organization must first have the necessary internal skill set and expertise. Given these skill sets are in high demand, it is often expensive and difficult to retain them.
Although the risks of implementing a virtualized environment do not outweigh the rewards, it is critical for companies to consider them and plan accordingly, with a special emphasis on how business continuity fits into the picture. A company’s existing information availability plan will require modification.
Companies considering migrating to virtualization should pay particular attention to the risks associated with deployment specifically around the challenges in management, required skill set and business continuity.
Getting Back to the Basics – Focus on DR Best Practices
Virtualization, or rather, some of the technology vendors selling these solutions, have created a perception that virtualization has built-in disaster recovery. Core functionality of virtualization solutions allow for virtual machines to be started, stopped, copied and backed-up quickly and with great flexibility. However, this functionality – essentially considered to be fault-tolerance – works primarily within the local area network (LAN) and not over a wide area network (WAN). While there are benefits to local availability, organizations must take a holistic approach to ensuring their virtual environments are recoverable in the shortest amount of time. Local fault tolerance provides little value should the site go down completely.
This is when the basics and best practices of disaster recovery are critical and should be integrated within an overall virtualization solution. DR specialists recommend companies consider the following key elements of disaster recovery in any virtualization deployment to help ensure high levels of availability of virtualized systems:
- RTO/RPO – Clearly understand and identify what your RTO and RPO are for your virtualization systems. This will dictate the design of your overall disaster recovery solution.
- Process and Planning – Assess, plan and document your recovery strategy as it relates to your virtualized systems. Chances are your existing plan will not support virtualization 100 percent and will need to be modified. Don’t forget to think about how business processes may be changing in a virtual environment, too – it’s more than just about updates to the IT infrastructure.
- Geographic Diversity and Hardened Facility – IT managers should remember fail-over does not equal disaster recovery. It’s important that plans be made for a second data center or recovery site in the case the primary data center becomes completely unavailable. In this scenario, a second site is vital to recovering your business critical virtualized systems, which include all needed infrastructure, systems, redundant power and network, security and storage.
- While local redundancy will provide failover capability should a local system go down and the workload redirected to another local system via organic capabilities in most virtualization technologies, it is not a comprehensive and true disaster recovery plan. If the entire site goes down, the local availability design and features embedded in virtual technologies provide no protection. True disaster recovery must include a geographically diverse and hardened facility as the ante to get into the game.
- A Place for Your People – In a total loss scenario – where a company’s offices are rendered unusable because of fire or flood, for example – the continuity of the business rests in the people. Employees need a place to field inbound customer care calls or to keep the sales pipeline moving. The payroll department needs to issue paychecks and accounts receivable will need a desk, computer and phone to stay on top of accounting needs. Companies too often forget the importance of making workgroup arrangements in advance of a disruption and should consider contracting with a third party to help ensure availability of office space and standard equipment like desktop computers and phones to keep the business up and running.
- Management Capability & Expertise – Supplement your existing IT staff with a trusted third party that has experience in disaster recovery and system recovery to provide enhanced management capability to ensure you reach your RTO/RPO.
- Test, Test & Test – Whether virtualized or not, an organization’s disaster recovery solution must be successful in meeting the established RTO and RPO objectives. Be sure to test your plan multiple times a year to ensure it meets your objectives.
Increasing Information Availability While Shortening Your RPOs/RTOs
Information availability has quickly become more critical for organizations across multiple industries and sectors with the level of acceptable downtime decreasing substantially. Whether it’s industry regulations, supplier or internal service level agreements (SLAs) driving the reduction, sub-eight-hour recovery windows have become a requirement today and the luxury of 24-48 hours to recover is quickly becoming a thing of the past.
To align virtualization with more aggressive availability levels to shorten RTOs and RPOs, organizations should consider integrating a replication based solution into their recovery plan for their virtualized systems. By integrating advanced recovery solutions like server and storage based replication, organizations can recover their virtual infrastructure in a fraction of the time it would take to restore from tape. Disaster recovery solutions which do integrate replication typically experience restoration in less than hours or even minutes versus days.
Putting virtualization and replication into practice can help companies reduce RTOs and RPOs today. For example, virtual server files can be pre-built and pre-tested prior to a disaster. There are two primary benefits:
- Taking this step will help eliminate costly server rebuild mistakes that happen during the rush to get the servers back on-line in the aftermath of an incident;
- Larger enterprises running many, many servers face a daunting task if the primary data center location becomes unavailable or is a total loss—most do not have the personnel resources to build and install servers in time to meet their RTOs. Using virtualization, the IT team can pre-build and test servers in advance, so that they are ready when a disruption occurs.
Organizations should also take advantage of hardware abstraction. With virtual file abstraction, IT departments no longer must abide by the standard “like hardware” rules of traditional server technology – the brand and type of servers in the primary data center and the recovery site do not have to match. During disasters, recovery professionals may waste valuable time searching for the same or similar hardware to replace the failed server hardware. This can save valuable time, helping to meet or beat RTOs.
Finally, virtualization provides the ability to run multiple virtual servers on one physical host server. When disaster strikes, or even the every day disruption, instead of installing multiple servers and connecting them to your network, with virtualization the IT team will greatly reduce the number of physical servers that must be brought back on-line. This “many-applications-on-one server” ability will not only reduce RTOs, but also reduce the costs and requirements for fibre channel and network ports, rack space and power.
Diving Deeper – Comparing Server- and Array-based Replication
There are many options available today to help with your replication needs. The most commonly discussed technologies include server-based and array-based replication. The primary differences between server based replication and array-based replication are the following:
- Server-based replication is more of a small-to-medium business solution focused on the x86 environment. Replication occurs at the server level and is executed from server to server. The primary production server acts as the source and replicates data to a secondary server know as the target which typically is located in a geographically diverse second data center. At the time of a failure to the primary data center, the secondary can be promoted to “active state” in a very short amount of time. Also worth noting, typically server based replication is not distance sensitive and can be used over great distances, which is a valuable attribute for companies with advanced recovery strategies.
- Array-based replication is typically deployed by the largest of organizations. The replication occurs at the storage array level – specifically, the primary storage array replicates data to the secondary array. Much like server-based replication, at the time of a failure to the primary the secondary can be promoted to active state in a very short amount of time. One drawback to note: Some array-based replication solutions are distance sensitive – meaning a company may not be able to ensure geographic diversity using this solution.
Incorporating either server or storage array based replication into an advanced recovery solution within a virtualized environment will allow companies to meet the most aggressive RTOs and RPOs. Both methods have their advantages and preference is more a function of the organization’s IT scale, budget and maturity. However it is important to note that both solutions often require identical sets of hardware and software to be purchased at both the source and target locations which is often an expensive proposition.
In addition, an organization will need the in-house expertise, a secondary site and key management tools to effectively manage this solution. If an organization is adverse to one or more of these characteristics, they should consider a service provider which uses shared backend infrastructure to keep the costs contained, has the experience and expertise with replication and virtualization technologies and a deep track record around disaster recovery.
Finding Virtualization and DR Nirvana
Virtualization holds great promise for helping companies achieve sound disaster preparedness, especially when combined with a replication strategy to ensure access to fresh and updated enterprise information in the time of a crisis.
DR specialists should be closely involved in the development of a company’s recovery strategy when virtualization comes into play – to be sure DR best practices aren’t abandoned in favor of a less sound and less effective fault-tolerance strategy.
As product development manager for SunGard Availability Services, Matt Carey is a key contributor in shaping the company’s strategic product roadmap and is responsible for the management and execution of new product development initiatives. As part of the virtualization task force, Carey is on the front lines evaluating, testing and implementing new virtualization technologies, helping tie together fundamental technology, people and process elements to help ensure SunGard is offering the high levels of availability customers are demanding.
"Appeared in DRJ's Summer 2008 Issue"