Disaster recovery is both a time-consuming and expensive initiative, but for those companies that have experienced a real-world catastrophe, the investment of time and money they’ve made in designing, implementing, and maintaining a DR strategy is immeasurable. It is this very reason disaster recovery continues to rank among the top enterprise storage priorities today.
The process of rolling out a recovery plan from start to finish involves a significant amount of forethought, time, cooperation, and logistics. Support for DR and adherence to it across the enterprise is critical to its success, and ultimately the success and longevity of the enterprise itself.
There are many options to consider with regard to technologies and techniques, and a number of common denominators exist across the board for companies ranked best-in-class for disaster recovery, including testing frequency and procedure, situational performance optimization, regulatory compliance, and media reliability.
Test Often and Test Non-Disruptively
Only 5 percent of best-in-class enterprises perform disaster recovery testing on a monthly basis, according to a 2008 research study by Aberdeen Group. For most companies, the frequency of testing often depends on two factors: resources and technology.
Testing traditional tape-based systems involve shipping tapes, sometimes by the thousands, to the remote DR site, and sending teams of people representing different departments to ensure the needs of each are met. Assuming everything goes according to plan and a full recovery test is performed, the process can remove these employees from their day-to-day responsibilities for up to four days.
In a pure tape environment, the time it takes for the tests to be completed is not likely to make a difference in the organization’s back-up procedure if they are backing up their data in 72-hour intervals. Likewise, in a virtual tape scenario there is no impact from the testing – companies continue copying and simply test with the data presented.
Where companies run into problems with frequent recovery testing is in virtualized storage environments where replication must be broken during testing. Breaking the replication link between two sites effectively takes the remote replication process completely offline for the duration of the DR test, which, as discussed earlier, can take several days.
Furthermore, once the link is reestablished, it takes time for data to synchronize while the back-up site catches up with the primary site. During this time, businesses are highly vulnerable to either a disaster or a compliance event taking place that requires immediate access to company data.
Disaster recovery can sometimes become a tradeoff situation. On one hand, companies can optimize their system for fast backup at the cost of recovery time, and on the other hand companies can choose to optimize their recovery by sacrificing back-up time. These are typically the options people review when trying to build out a DR process.
It becomes a matter of answering, “What’s important to me?” Is it the back-up window that has to be met so that other jobs can be completed in time? It becomes a fairly significant topic of discussion when we’re talking about a difference of six or more hours between fast and slow backups.
In a slow backup-fast recovery scenario, the tapes are separated and categorized in a forward-thinking manner; tapes that will be needed initially are separated from restore tapes, which are separated from database building tapes, which are separated from application tapes, etc. The entire back-up operation is spread out and organized in a way that minimizes the amount of time it will take to recover.
Conversely, when fast backup is the goal, data is lumped en masse onto tape and then organized and dealt with only when it becomes necessary to do so. The data is all there, but there is no rhyme or reason to it and the recovery process becomes incredibly tedious and time consuming. Obviously this isn’t an ideal scenario during a crisis and presents significant risks to an organization, but for the immediate savings in man-hours and computer-hours, many companies have chosen this option.
For enterprises employing traditional storage technologies, these options are generally the only ones available for managers. Additional technologies such as tape-to-disk VTLs can be implemented in mainframe environments as a compromise for both quick backup and quick recovery. Because of the way these technologies structure data on disks, it is instantly available so isolating/organizing the file system is no longer necessary.
More and more today, organizations are taking this approach to their data backup and recovery procedures, easily covering the cost of the hardware investment though savings in labor and processing time.
Data deduplication is a technological advancement that dramatically reduces the amount of disk storage needed to retain and protect enterprise data. It also greatly reduces the bandwidth requirements to electronically vault data from your primary site to a secondary or disaster recovery site. The question is, which data is suited for deduplication, and which is better suited for simple compression?
One characteristic of back-up data is that it typically has a small rate of change and therefore is a good candidate for deduplication and the advantages this technology offers. Other data types, such as database log files or fixed content data (typically consisting of unique data streams), would not deduplicate well and are probably better suited by data compression.
Ideally, the platform performing backups will be capable of supporting multiple storage types on the backend and the various needs of the organization. However, it should always be a best practice to examine one’s environment first and determine how the various data types are best handled.
Media choices also play an important role in optimizing disaster recovery processes. As previously discussed, backup and recovery speeds vary greatly between tape and disk; however, media errors are also a big concern. While tapes can be affected by transportation, humidity, handling, loss, and theft, disks are more fragile and rely on environmental factors such as power and cooling.
The tradeoff between these two technologies has, and will continue to be, debated for many years. It’s important to remember that each argument will be circumstantially based upon specific pain points in the enterprise. For instance, it isn’t unheard of that a user would experience a 5 percent read error rate on tape. With percentages that high, an organization would likely be creating second and third copies of their data tapes! If copies one and two were no good, they had copy three. It may sound excessive, but for these companies, it was a matter of operational survivability.
No matter which side of the fence you or your company fall, media error testing is a necessity in the overall disaster recovery testing process.
One of the most important things to remember when evaluating a current or future disaster recovery plan is that there is no magic formula that works for every company and in every situation. While it is prudent to evaluate and understand the methods employed by best-in-class DR leaders, the model should always be adopted to address the challenges and nuances of each different company. For some, it could be budget constraints or human resources, for others it may largely depend on their existing infrastructure. Ultimately, the goal should be reliable recovery and effective backup in a sustainable disaster recovery plan.
Jim O’Connor has been working in the storage industry for 40 years, 19 of those years with Bus-Tech. As director of product marketing, he helps set the strategic direction for the company and the development of its leading product lines, as well as driving Bus-Tech’s overall visibility. O’Connor can be reached at jim.o’firstname.lastname@example.org.