For many years, disaster recovery (DR), and specifically data recovery, meant one thing and one thing only to the IT department: recover data and applications from back-up tapes stored offsite. For this purpose, organizations backed up all of their data to magnetic tapes nightly and then shipped the tapes offsite, where they would be safe from any disaster that struck the primary data center. Even today, many companies still rely on this process exclusively to protect their data. While tape has been the back-up medium of choice primarily because it is affordable, it is not ideal -- far from it. In fact, tape suffers from a number of serious drawbacks.
The Trouble with Tape
Recovery time is the most obvious tape issue. Although tape-drive speeds have increased significantly over the years, tape is still a considerably slower medium than disk. At the same time, the volume of data that enterprises store has mushroomed, often at a rate that has far outpaced the improvement in tape-drive speeds. As a result, it can still take hours or even days to recover a complete data center from backup tapes, particularly if those tapes have to be retrieved from a remote site. Not being able to serve customers and transact business for that long may threaten the solvency of many companies.
The trouble with tape, however, does not start with a disaster, but rather with the creation of the backup tapes, which typically occurs nightly. The recurring problem is that, with traditional backup technologies, applications accessing the data being backed up usually have to be shut down while the backup jobs are running. Despite “save-while-active” technologies backup jobs still place such a heavy load on the processors and disk drives that application response often slows to an unacceptable level. These challenges persist even in organizations that back-up data to an intermediate disk pool before copying it to tape.
In the days when organizations conducted business for only eight hours per day and just five days per week, the disadvantages of tape were not critical issues. Then, backups could be created at night and on the weekends without disrupting business operations. Actually, for some organizations, such as multi-shift manufacturers, the availability of that backup window was always a myth. Thanks to Internet-based commerce, globalization, and competitive pressures to more fully leverage capital assets, it is now a thing of the past for a large and increasing number of enterprises across all industries.
The shortcomings of tape don’t end there. Back-up tapes are usually created once every 24 hours, typically sometime in the middle of the night. Any data updates applied during the day are not represented on the tapes. Thus, if a disaster destroys the data center, those updates may not be recoverable. Although the lost data may be available in online journals, those journals are usually located in the same facility as the production databases where they could be destroyed by the same disaster.
Companies try to manage these risks by shipping back-up tapes offsite. However, shipment isn’t immediate because a courier may not pick up the tapes until the morning. Furthermore, organizations often keep the most recent generation of tapes onsite for a day to deal with issues that are much more common than disasters, such as the accidental deletion of a file by an operator or user. While tapes remain onsite, they are vulnerable to any disaster that threatens the production databases. As a result, organizations may lose considerably more than 24 hours’ worth of data when a disaster strikes.
Tape Fails the Test
Another data recovery issue is rarely talked about because, until recently, there were no products on the market that could resolve the problem. The fact is that most data recovery requirements do not result from disasters. They arise due to human error, malicious actions, computer viruses, and other isolated events that corrupt or delete individual data items without necessarily bringing down the associated applications.
These events can happen at any time of the day. Consequently, the ideal solution would allow for the quick recovery of the data item to its state immediately before the corruption or deletion occurred. Tape-based backups fail this test.
In addition, back-up tapes allow data to be recovered only to its state at a discrete point, typically sometime during the previous night. Thus, recovering a data item from tape may result in the loss of several updates that were applied to that item after the backup tape was created but before the corruption or deletion occurred.
The New Data Recovery Approach
An emerging new approach to recovering data, which in some cases also addresses broader disaster recovery requirements as well, overcomes all of the drawbacks of tape. It employs related technologies that can be installed as independent software or through an integrated solution that includes them all.
The first technology is high availability (HA). One HA architecture amounts to a switched-disk configuration that makes it easy to switch storage between servers when necessary. However, a more robust HA solution maintains a real-time or near real-time replica of the production server and its data to a back-up server that stands ready to assume a production role whenever necessary. The HA software typically includes features that either assist or fully automate the switchover to the back-up server.
Because HA software can replicate data over long distances, the back-up server can be located far enough from the primary server such that a single disaster is unlikely to affect both servers. Then, rather than “recovering” from a disaster in the traditional sense, users are simply switched to the remote replica server.
Some organizations locate the primary and secondary servers in the same building to eliminate the cost of maintaining a second facility. In this case, the HA solution provides a means for operations to continue, while the primary server is undergoing maintenance. However, this is not a DR solution because a disaster will likely destroy both co-located servers. Nonetheless, this HA configuration still has a role to play in eliminating one of the drawbacks of tape-based data recovery. Because the secondary server contains an up-to-date copy of all production data, back-up tapes can be created on that server, thereby eliminating the need for any downtime while the back-up jobs are running.
Regardless of whether the secondary server is co-located with the production server, an HA solution does more than just protect data and applications during a disaster. The same technology allows organizations to continue to function, while the primary server or its operating system, applications, or databases are being upgraded or maintained.
Disk-based snapshots are also a part of the new data recovery approach. Snapshots allow an organization to create disk-based backups that, after a disaster, can be used to recover a data center much faster than is possible with tapes.
The snapshot function is generally implemented within an HA architecture, enabling snapshots to be created on the recovery server, so that snapshot processes will not impact production operations. Snapshots can then be generated at will to create clean recovery points as appropriate. The snapshot facility can do double duty by allowing IT departments to easily create “sandboxes” for developing and testing software using real-world data.
Continuous Data Protection
The most recent data recovery innovation goes by the name of Continuous Data Protection (CDP). This is an important advance. It addresses the data recovery requirement that organizations face most often but which has remained unfulfilled or underserved until recently.
CDP captures each production data update and stores it in what are essentially redo and undo logs. These logs can later be used to reconstruct an earlier state of the data. Contrast CDP with tape-based backups, which allow for recovery to only a single point during the previous night. Although snapshots used in conjunction with tape-based backups can provide a few additional recovery points, the frequency of snapshot-taking has to be predetermined. In addition, the database often has to be quiesced, which is one of the reasons why the frequency of snapshots cannot be granular. While the frequency of snapshots must be set beforehand, CDP allows the business to determine the desired recovery point after-the-fact. Further, CDP permits organizations to recover individual files or data items to any point in time (compared to HA, which allows for recovery to only the point of failure).
In addition to offering more frequent, granular and flexible recovery point options than tape-based alternatives or even snapshot-type solutions, CDP offers another advantage. Because it is disk-based, it can fulfill much more stringent recovery time objectives than would be possible when recovering from tape. Further, recovering individual data items from a tape typically requires considerable work on the part of an operator. The task is, therefore, a drain on human resources and prone to human error. With CDP, on the other hand, the recovery of a data item is initiated through a graphical user interface.
There are two types of CDP: true CDP and near CDP. As the name implies, true CDP offers truly continuous data protection. Data updates are captured and sent to the backup data store as they happen. Consequently, data can be restored to any point in time within the retention span of the CDP implementation. Retention span is an important element to consider. Theoretically, a CDP tool can store historical data forever. However, most data corruptions or deletions are spotted within hours or days, retaining the backup data for years would waste considerable space. As a result, most organizations purge CDP data after it reaches a certain age.
In contrast with true CDP, near CDP does not continuously send updates to the backup data store. Instead, it transmits updates only at particular points in time. How those points are defined depends on the CDP tool, but a typical strategy is to send updates to the backup only when a file is saved or closed. While this ensures that the backup data store will contain only clean recovery points, in some cases, it may result in updates not being backed up for several hours. For organizations that process particularly high transaction volumes or for companies that operate under strict data protection regulations, this backup frequency may be inadequate.
A New Perspective
Today’s data recovery tactics should not be yesterday’s DR tactics. Technologies have advanced considerably over the years, making it possible for organizations to address problems that they simply had to live with in the past. Forward thinking organizations now take a broader view of both disasters and recovery. The definition of “disaster” has expanded to include not only truly catastrophic events, such as natural disasters, but also incidents that threaten to disrupt only small parts of the business, such as might happen through the accidental deletion of a file or through the data corruption caused by a computer virus or malicious activity.
The definition of recovery has also expanded to include not just the recovery of data after a disaster, but also the avoidance of the need for disaster recovery in the first place. This can be achieved by implementing HA solutions that incorporate remote backup servers. A broader definition of recovery also goes beyond the macro-level recovery of an entire data center, to include the recovery of individual files and data items when appropriate.
Today’s sophisticated recovery solutions must be designed to solve problems rather than just to provide a “good enough” alternative. They permit the rapid restoration of both individual data items and complete databases to any point in time, as required by the circumstances.
Ferenc Gyurcsan has more than a decade of experience in the clustering industry, with exposure to high availability and high performance compute clusters. During the past nine years he worked for Vision Solutions and its predecessors (CLAM Associates, Availant, Lakeview Technology) in the high availability, disaster recovery, and business continuity arena with a focus on AIX and Linux solutions. His roles included lead software engineer for IBM’s HACMP product suite, as well as sustaining development manager for IBM PowerHA and for Vision’s open systems products. Most recently he has been a senior solutions consultant and a member of the product management team for Vision’s EchoStream and EchoCluster product portfolio. Gyurcsan holds an MBA degree from the University of New Hampshire, an MSCE degree from the Budapest University of Technology and Economics, and a product management PMC certification.