Why Is It Different?
Currently, the most common anti-disaster protection methods employed include:
• automated backups
• off-site media storage
• data mirroring
• remote data replication
• snapshot of data
Automated backups ensure that files are continually backed up (commonly onto tape) on a routine basis; a second copy is often stored off-site for safekeeping. Rebuilding “live” data from tape is cumbersome and time-centric, requiring an average of 17-25 hours to restore a one-terabyte data volume. If the tape must be recovered from an off-line and off-site storage facility, recovery may be measured in days.
Data mirroring creates a duplicate (secondary) on-line copy that in an emergency replaces the primary data. Remote data replication tasks simultaneously duplicate data to a secondary system, ensuring continuous access should a primary system fail. Data mirroring and remote replication techniques eliminate the wait time associated with the loading and restoring of back-up tapes following a data disaster, as the replicated data can be substituted quickly. However, they each have the same inherent limitation – both techniques may replicate corrupted data as it enters the system, leaving a damaged database being copied for recovery. These disadvantages can be addressed with snapshots by adding a point-in-time data image feature to existing back-up and recovery procedures. By doing this, the chance of reintroducing corrupted data during the recovery process is reduced. However there is still the risk of data loss between point-in-time images and the snapshot process impacts the system and application.
Continuously capturing data makes possible a rollback to any point in time, eliminating the limitations of snapshots. This approach facilitates the recovery of vast amounts of data by backing out data corruption, rather than rebuilding from archives and snapshots. Applications can be restored accurately and verifiably with unprecedented speed.
With the ability to back out the corrupted data rather than rebuild the entire data image, full data restores following a major data disaster occur in minutes versus hours. For example, 1 TB or more of data can be fully restored in less than 20 minutes.
How Rollback Works
By maintaining a history journal always queued for immediate recovery, there is a constant running record of application activity and thus it is possible to “time slide” forward or backward to any point in time and recover data in its uncorrupted state. Writes are continuously intercepted and tagged as they occur in real time without altering the actual data or impacting the application. A second process (typically on a remote system) maintains an active replication and journals a history of data activity. The recording and journaling of data has no impact on the running applications or server operations, requires very little storage space and facilitates the ability to scroll the data to any point in time.
Because it’s working at the block level, this process is not dependent upon specific applications or storage solutions and is compatible with and provides near real time recovery for any application, relational database and file system; it also is not constrained by protocols or formats. Because everything is tracked on disk, a rollback to any point in time is possible, enabling the user to back out the corruption rather than rebuild the data structure. Remember that this process does not replace your existing data archive solution. It adds an additional layer of protection, providing for almost immediate data recovery following any type of data disaster.
When data restore is necessary, a roll back through the history journal can establish a precise restore point. Once a valid restore point is identified the rollback sequence is committed to the production application data. Recovery methodologies include full and partial restoration and disaster procedures.
Full recovery is accomplished by rolling back a copy on the back-up server to validate the restore point. The same rollback sequence is applied to the production application to resynchronize the active data. Optionally the rollback can be applied directly to the production application, bypassing the validation on the back-up server.
Partial recovery is accomplished in the same manner to determine the restore point. Affected tables and records are extracted and then inserted into the production application. A tremendous benefit of this partial recovery process is that the running database remains operational for applications not accessing these specific records.
Disaster recovery is accomplished by pointing to the virtual copy of the application’s data on the back-up server. The applications’ image is “rewound” on disk back to any “live” transactional processing point in time.
To date, this rollback recovery process has been successfully tested with Solaris (SUN) and AIX (IBM) operating systems and coupled with existing file back-up solutions, providing near zero restoration time. It runs across two servers – the application (or production) server and a back-up server; it co-resides on the server holding the storage management software. Requirements for implementation are a 300MHz+ processor with a minimum of 512MB of RAM and 2 percent extra disk space on the client side, and a 300MHz+ processor with 1GB of RAM and 120 percent of disk of the client on the server side.
This non-intrusive continuous data capture and immediate recovery process is raising the bar for data recovery performance and business continuity standards.
Administrators should frequently revisit their enterprises’ in-place business continuance plan to ensure that it is taking into account newly introduced disruptive advances and enhancements in recovery technology. One such important transformation is occurring now.
Jeff Iverson is vice president of strategic and technical alliances for Vyant Technologies, Inc. (www.vyanttech.com), a Fairfax, Va., software development company. Iverson has 22 years experience in storage management, systems integration and application development.