When developing a disaster recovery strategy, customers frequently ask me to clarify the difference between disaster recovery and high availability. And now with the rise of cloud computing, they also ask, “How will cloud make a difference in what I should consider?”
The last question first. There are two sides to a cloud; the service side characterized by a self-service portal and ubiquitous access, and the architecture side built on separated geographic locations supporting redundant data and processing. When developing your recovery strategy, cloud computing environments must be considered from the application architecture point of view, rather than a service point of view. A mature cloud architecture is one where the technology, data, and applications reside in multiple locations, operating simultaneously. If any one of the locations becomes unavailable, the other locations continue to operate without it. Further, when that ‘missing’ location comes back online, it is assimilated into the data and processing collective providing processing and storage resources for the business. I consider cloud solutions to be a supporting tool and architecture to meet one or more of your recovery or resiliency objectives.
The objective with this article, therefore, is to clearly define and differentiate the difference between high availability (or resiliency) and disaster recovery and lay out the considerations so you have a better understanding when developing a recovery strategy. I’ll also discuss themes that come into play – like disruption of service and business impact analysis – that will better help you determine the level of time, resources and budget needed for a strong plan.
Let’s begin by categorizing recovery. Then, we’ll explore the interrelationship between the different recovery approaches and how they best meet the business requirements. The goal is to take a programmatic approach to recovery and identify a strategy that takes into account an application’s resiliency and recovery requirements.
In my experience, there are several categories of recovery. To help my customers, I like to provide a consistent definition so that we’re all speaking a common language.
Business Continuity: First and foremost is business recovery, which is the practice and discipline of returning a business process to full execution after a disruption of service has occurred (I’ll get to defining disruption of service later). This includes practices for returning people, technology, and facilities to service levels that the business needs to run optimally.
Disaster Recovery: A sub-discipline of business continuity is the practice of disaster recovery. This comprises the information technology architecture and processes to recover the technology and/or applications to service in a location other than your primary processing location.
Operational Recovery: Like disaster recovery, this practice focuses on the recovery of applications and technology; however, in this case, the recovery is within the primary processing location. Operational recovery contemplates the need to recover from a service disruption that does not require a disaster declaration. Operational recoveries may be daily occurrences responding to something that is not operating to plan and needs to be fixed, for example, failure of a file server.
Resiliency: In contrast to planning for recovery is planning for resiliency or uptime. Let’s face it, our data centers are filled with electromechanical gizmos that will eventually fail regardless of the vendors’ MTBF (mean time between failures) claim. Application clustering, hardware clustering, virtualization, and local replication technologies are all used to keep an application resilient. The basic idea is to build an architecture that can continue to operate without disruption even when one of its components fails. Resiliency is focused on “uptime” rather than recovery time.
The critical driver of resiliency and recovery is, of course, risk. The key question to be answered is: “How likely is it that I may incur a disruption of service and how heavily should I invest to protect against that outage?” This will lead us to the three “Rs” of establishing a strategy: risk, recovery, and resiliency.
Defining Disruption of Service
Now that we’ve established the relationship between resiliency and recovery, let’s briefly discuss and define disruption of service and the role this plays in your plan. A disruption of service exists whenever there is a resource that is unavailable to support the business. Consider that most organizations have three major categories of resources: people, facilities, and technology.
In business continuity planning, disruption of service is measured by the extent that the resource is unavailable. The severity of the disruption of service or lack of a resource is affected by the length of time that the resource will be unavailable and the severity of the event that affected the resource. Therefore, the longer the resource is unavailable and/or the more damage to that resource, the more severe the disruption of service or outage.
The table illustrates this idea:
Duration of Time
Severity of the Event
Destruction of the Resource
Damage to the Resource
Long Term Loss of the Resource
Short Term Loss of the Resource
Developing a Balanced Strategy with Business Impact Analysis
Three important criteria need to be considered in order to define a good strategy:
- An understanding of the resources that you have to support the business.
- An understanding of the impact of an outage or a disruption of service.
- An understanding of the risk tolerance of your company’s culture.
A business impact analysis, whether it’s a formal study of the business process or an informal estimation, should define the impact a disruption of service has on the business and then quantify that outage in business terms. As an example: If you can’t process sales orders for a day, this may have a $1 million negative impact on your business. However, if you can’t process orders for four days, the negative impact could be as much as $7 million. And if the outage continues for more than two weeks, it could bankrupt you altogether. Thus the idea that the longer the outage, the more severe the impact to the business.
This information establishes the baseline for making a business case regarding the type and level of investment that the organization should make to prevent or plan for the recovery after an outage. The more likely the occurrence and the more costly the disruption of service, the more justifiable an investment in recovery or resiliency is.
People, Facilities, and Technology—Your Focus for Recovery
The employees in your organization are among your key resources. A disruption of this resource will result from a significant number of employees not being able to perform their job in their normal work location—for example, due to severe weather, an outbreak of the flu, or a pandemic. Planning for this type of disruption has several components:
- Ensuring that important tasks within your organization are understood and can be performed by multiple resources.
- Enabling important tasks to be performed from more than one location.
- Cross training employees to reduce reliance on any individual.
- Utilizing telecommuting and mobile computing technologies as part of your plan.
The loss of the facility or workplace resource it is generally a binary consideration; you either can get into the facility or you cannot. If you can’t, you need to know how long it will be unavailable. A power outage in an office complex is a relatively common occurrence. If you know the situation is temporary, you may resolve to send your employees home for the day or to a nearby secondary office where they can work from available common areas such as conference rooms. On the other hand, if the outage will be prolonged, a more detailed plan stating which business processes will be moved to an alternate work location and must be completed.
The technology has both resiliency and recovery characteristics that must be considered in parallel. An application that requires constant access will need to be architected for both resiliency and recovery. Resilience is the ability for the application to continue to operate in the event that one or more of its individual components fail, whereas recovery is the process and time it takes to restore the application as a result of a disaster declaration requiring the application to be restored in an alternate processing location.
Planning From the Bottom Up
I typically look at the data center as a stratum. There is the facility itself, consisting of the physical building, mechanical, electrical, HVAC, piping and so on, which makes up the “container” for all the technology. Then there is an infrastructure layer that is the foundation for all other technologies in the data center. Within the infrastructure layer are the network, authentication, security, access points, and back-up systems. On top of the infrastructure layer are storage arrays, servers, applications and/or databases.
For the purposes of this article, I am excluding the context of the physical data center and focusing on the technology layers within the data center.
The infrastructure layer needs to be architected in the most resilient fashion possible, since all recovery and resiliency capabilities will be based upon the weakest link. This layer actually represents that part of the data center which must be built to 24 x 7 x forever uptime. Redundant components, secondary routes, and redundant power are all characteristics of a redundant infrastructure. When considering a secondary site, best practice is to develop a site that is an extension of your primary site, not a mirror or isolated copy. This is generally one of the more difficult architectures to achieve.
The storage layer contains the vital data of your business. Let’s face it—if a major disaster was to strike, you would eventually be able to replace your servers, applications, and network, but if your data is lost, it’s lost forever. In a resilient environment, there are copies of the data for local processing, copies for local restoration and local outages, and copies for disaster recovery. When considering how to architect the data storage for your business, plan for at least four aspects of data management:
- Disaster recovery – an offsite copy of the data that can be restored.
- Operational recovery – replacement of the local copy of the data in the event it gets erased or corrupted.
- Archive – the long-term storage of data based upon your company’s retention and regulatory policies and production.
- Replication – the need to have multiple copies or secondary copies of the data where processes such as data mining, reporting, or data warehousing can be performed without affecting the transaction performance of the application.
The server layer includes the file servers and the strategy for keeping them active and running, regardless of platform type (Mainframe, UNIX, Linux, Windows, etc.). Three common architectural approaches to resiliency and recovery are:
- Virtualizing the servers so the application sits on a virtual container or virtual machine on the server. The virtual environment can be built so there is enough capacity for an N+x relationship between the virtual environment and the virtual machine’s capacity needs.
- Clustering the servers so that two or more servers are working in unison supporting an application or process. Unlike virtualization, clustering typically focuses on a specific application or set of servers and is not necessarily intended to establish a ubiquitous processing container for the applications.
- Load balancing the servers to create two separate processing domains that can work in conjunction with each other and yet are maintained separately. To accomplish this, an appliance typically sits in front of the servers to direct where the transaction request will be processed.
In any of these architectures, constant monitoring should be maintained between the devices to determine the state of any one device. If something were to fail in one device, another would pick up the load. Stand-by servers that are in some state of readiness can then be quickly put into service, replacing the defective component and returning the architecture to its full capacity.
These three approaches can be equally applied in a resilient environment as well as a secondary site for recovery.
At the application layer, application clusters can be implemented in such a manner that they share information and are aware of the components of the cluster. If one application component fails, the other one continues to provide the service required by the application.
The challenge in developing a disaster recovery strategy is to determine the best fit of hardware and/or software resiliency/recovery for your environment, recognizing that although a particular solution may work well for a specific purpose, it may also complicate the overall recovery/resiliency architecture. The combination of both hardware and software architectures should be evaluated based on your overall resiliency and recovery strategy and not on how well it satisfies the requirements of a single application component.
When developing your recovery strategy, remember that the approach to recovery is intended to support the business and should include aspects of people, technology, and facility. The three Rs of planning; risk, recovery and resiliency should be included throughout your strategy and business case for your program.
John Linse is a practice manager, business continuity and data protection for EMC Consulting, EMC Corporation.