There is no shortage of information regarding various disaster recovery strategies. The sheer volume can overwhelm contingency planners as they attempt to sift through this information and develop a disaster recovery strategy to fit the needs of their business.
Companies must decide on the amount of resources they are willing to commit to disaster recovery preparedness. The level of disaster recovery a company requires is based on a number of factors.
These include the following:
- Perceived exposure to disaster situations that could render the data processing environment inoperable.
- How long of an outage can the business tolerate before it is in jeopardy of going out of business?
- How much data can the business afford to lose?
- What type of government regulations affect how quickly critical business data must be restored?
- Exactly which applications are critical to the successful continuance of operations and which are not critical.
- How much is the company willing to spend on a contingency plan?
What are the corporate audit requirements to certify recovery of applications? This cost includes creating and maintaining the plan, testing the plan and negotiating an alternative data processing center.
In addition to these tradeoffs, two layers of complexity also have been added:
- One is the recent twist in the growing number of mission and business critical systems that have been "rightist," i.e. spreading critical applications from a single point to many. Now, companies must factor into their plans a greater number of boxes, as well as an increased mix of many different operating systems and application support environments.
- The second new layer of complexity is the use of the internet. As organizations experiment, adopt and become dependent on new applications with high visibility, the costs of estimating an outage goes beyond the immediate loss of revenue from system downtime. Organizations suffering the loss of their internet presence also suffer from loss of credibility, similar to the affect of a disconnected phone line on voice-dependent organizations.
Finally, how much will it cost NOT to implement a disaster recovery solution.
Companies must weigh the above factors when deciding on a suitable level of disaster recovery. The business must determine the amount of risk they are willing to accept or, in other words, how much data they can afford to lose in the event of a disaster. The level of risk a company is willing to assume will determine the amount of resources needed for contingency planning. Generally, the lower the risks, the higher the costs.
One strategy for reducing costs of any level of disaster recovery is to consider a company's existing assets as well as exposures. In disaster recovery planning, it is easy to over focus on mitigating risk, since that criteria MUST be met. Integrating existing tools, such as those used for local incident recovery, into the disaster toolbox can dramatically lower acquisition, training, process development and maintenance costs for day-to-day operations as well as for contingency planning.
Companies need to choose a disaster recovery strategy that is appropriate to the required level of data availability after occurrence of disastrous events. There are essentially three levels of data availability:
Standard Availability: recovery of operations in more than 24 hours but less than three to five days.
High Availability: recovery of operations within 24 hours.
Continuous Availability: recovery of operations within seconds or minutes of the outage.
Disaster recovery solutions can be divided into a hierarchy that represents the following: backup methodology, data transport method, recovery process and level of data availability. This discussion applies to traditional mainframe environments, as well as distributed computing environments. For our purposes, we will segregate the hierarchy of disaster recovery solutions into various "tiers." Each tier will represent a different level of sophistication and performance of the disaster recovery solution. Companies then may use this hierarchical structure to decide which tier level best suits their disaster recovery needs. In many cases, a company will have a mixture of levels since some application data will be more or less critical than others. This article will define seven tiers of a disaster recovery hierarchy. In reality, each tier could be further refined into more discrete levels, or even combined into wider bands. The goal is to demonstrate that disaster recovery solutions can be divided in such a manner that a company can choose the level that best fits its cost structure, risk factor and data availability needs. The tiers range from Tier-0 at the bottom of the hierarchy to Tier-6 at the top of the pyramid. As the pyramid goes up, the level of sophistication and the cost structure increase. However, the elapsed time is reduced time between the start of a disaster and operations restored. We call this time,"Recovery Time Objective, or RTO."
Another benefit from the pyramid of tiers is the ability to measure the currency of data at the time of recovery as it relates to the event. We call this, "Recovery Point Objective, or RPO." In essence,the RPO defines the data an organization does not need in the event of a disaster.
Tier 0 - No Offsite Data
Tier 0 is defined by the lack of a disaster backup and recovery strategy. Data is not sent offsite, and there is no alternate site identified. There is no disaster recovery capability and any type of recovery will be accomplished by using local backups. Businesses at this level usually use some type of incremental backup or physical volume backup method. However, backup copies are stored locally to prevent destruction in event of a disaster.
Tier 1 - Pickup Truck Access Method (PTAM)
Tier 1 represents those businesses with a contingency plan where data is backed up and sent to an offsite location for recovery. Tier 1 is characterized by the manual transporting of backup data offsite (informally known as the Pickup Truck Access Method or PTAM) and the use of a "cold site," as an alternate site. A cold site usually is a structure that may contain raised floor space as well as adequate power, cooling, heating and water facilities. If a disaster occurs, the businesses must then obtain necessary data processing equipment and have it installed at the cold site.
The Tier 1 solution has relatively low cost, but it is cumbersome to manage and the recovery time (RTO) is quite long. "Standard Availability," and most likely something much lower, would be achieved at this level.
The data backup technique would possibly involve a physical volume dump process, a logical data set or application backup process. The backup medium would probably be tape, which can be removed from the backup site and manually transported to an alternate location for storage.
Tier 2 - Pickup Truck Access Method + "Hot Site"
Tier 2 is essentially Tier 1, with hot site capability. A hot site is an alternate site with data processing equipment that can adequately accommodate the installation's critical workloads. The PTAM access method is used to transport the backup data to an intermediate location for storage until a disaster is declared. At this time, the backup data would then be transported to the hot site and restored.
Tier 2 level usually involves a point-in-time backup of critical data, again using a physical volume process, logical data set or application-level backup. Companies should expect to achieve recovery of operations within the 24-hour window of High Availability in most environments with a well-planned, continually maintained and thoroughly tested Tier 2 implementation. Standard Availability should be expected otherwise, including the few environments where the physical transport of data requires most of the 24-hour window for High Availability.
Tier 3 - Electronic Vaulting
Tier 3 replaces the Pickup Truck Access Method with Electronic Vaulting. The data is backed up, and the output is electronically transmitted to an intermediate location or to a hot site for storage. A number of vendors supply electronic vaulting. This may consist of stand-alone tape drives that receive and write data to removable tapes, which may be stored in racks or bins. Alternatively, the electronic vault may be an automated tape library, virtual tape library or direct access storage device.
Data backup usually is performed with a logical backup process and staged to direct access, or to tape prior to transmission. Off-site data may be tracked by software that manages off-site data or manually tracked.
The Tier 3 solution usually results in Standard or High Data Availability depending on data transmission speeds and amount of critical data that must be restored. Companies are more likely to achieve High Availability if the electronic vault is located at the alternate site or connected to the alternate site through channels capable of long distance connectivity and high band width.
The amount of data loss, the Recovery Point Objective (RPO) for Tier 3, can range from one day's worth to just a few minutes worth of data. One day's worth of data is the maximum exposure usually associated with electronic journaling. The reason for this is that if more than one day's data is at risk, the most cost effective method of backup is to physically ship the data on tape to the vault. With a more aggressive approach to electronic vaulting, the RPO can be reduced down to a few minutes. This would require that data be sent offsite several times a day, immediately applied to disk and available to the application.
Tier 4 - Active Recovery Site
A Tier 4 solution involves two active sites, each capable of taking over the other's workload in the event of a disaster. Both sites should have enough idle processing power to restore data from the other site and to accommodate the excess workload in the event of a disaster. The two sites should be physically removed from each other and should be at greater than campus-wide if they are to handle regional disasters, such as floods or hurricanes.
If "High Availability" is the goal, data then should be backed up on a regular basis and transmitted to the sister site and electronically or manually stored. If the goal is for "Continuous Availability," there needs to be either continuous transmission of data between sites or some type of dual online storage. With a network switching capability, recovery of data can be reduced to hours or even minutes.
Tier 5 - Two site, Two phase commit
A Tier 5 implementation encompasses all aspects of a Tier 4 solution and maintains selected data such that both copies of the data (local and remote copies) are in sync. This requires that updates to data be received at both the primary and secondary locations before the owning application is notified that the update is complete. This requires dedicated hardware at both Site A and Site B, with the capability to automatically transfer the workload between sites.
At Tier 5 level only in-flight data should be lost in the event of a disaster, with a minimal amount of data required to make the application current, thus providing for continuous availability.Achieving Tier-5 level in the mainframe environment requires control units capable of creating shadow copies of data and data synchronization, channel extenders, and channels with extended distance connection and high bandwidth. Achieving Tier 5 with distributed systems is not usually accomplished with channel extension technologies. Software-based solutions that send data over a shared network using a communication protocol such as TCP/IP are available for many distributed platforms.
Tier 6 - Zero Data loss (Often referred to as Utopia)
Tier 6 is the ultimate in providing Continuous Availability of data. This involves the immediate and automatic transfer of processing capability to the secondary location. The Tier 6 approach requires local and remote copies of all data, dual on-line storage, and network switching capability. A level of hardware and software synergy that allows for near-instantaneous creation of remote copies of critical data and the capability to automatically switch access to data from the primary to the secondary platform. In the Tier 6 environment, if every application is treated as requiring zero data loss, the need for contingency planning is virtually eliminated and becomes part of everyday data processing. More realistically, all applications cannot justify Tier 6 resources and the complexity of the planning process is increased as part of having this advanced recovery option.
Without a doubt, Tier 6 provides the highest form of availability, with near-elimination of potential disaster risks. Tier 6 obviously is the most costly solution, but it may be well worth the investment when measured against the risk of a multi-million dollar business going out of business as a result of a disaster.
One of the few risks not addressed by a Tier 6 solution are logical errors introduced by the application. Maintaining the ability to recover to a point-in-time prior to introduction of a faulty application is an important facility to include in every Tier.
After a company chooses a disaster recovery strategy to meet its current needs, it should, on a periodic basis, review its program. Companies need to review and stay abreast of new business requirements, as well as government regulations and new backup hardware and software solutions.
Companies should remember this is the only way to implement a successful disaster recovery strategy.
For new IT projects, integrating disaster solutions as part of the acquisition process can yield significant improvements where "re-use" of components and processes are possible.
In addition to re-evaluating the selected disaster protection tools, the original "make or buy" decision should be re-examined for potential cost savings for outsourcing, or bringing back in-house, which are all elements of disaster recovery capabilities. These elements include understanding the potential risks to your organization, creating strategies and plans to address those risks, implementating, testing and repeating the cycle by evaluating current capabilities versus current business requirements.
With an unprecedented and unrelenting wave of technological change sweeping through all types of organizations an ever widening array of choices are available to contingency planners.
For IT issues, we hope that this Tiered delineation of the performance hierarchy for disaster recovery solutions will provide companies with a useful approach in the complex and vital task of disaster recovery planning.
Ed Baker - IBM Storage Systems Division, Disaster Recovery Software. Greg Van Hise - IBM Storage Systems Division, ADSM Development. Steve Luko - IBM Business Recovery Services, High Availability Solutions