2.1 FIGURE 3 DEFINITIONS
2.1.1 RECOVERY POINT OBJECTIVE/RECOVERY TIME OBJECTIVE COMPONENTS
- Transactions - Transactions Not Captured: Those transactions permanently lost (in product system's electronic format); i.e., the average interval of data not migrated off site at the time of the disaster.
- Declaration - Decision-making process, and the mechanics of getting the failover site and/or vendor notified. Includes determining whether to fail over and sometimes also where to do so.
- Data - Data Retrieval: Getting backup and logs out of storage and ready to ship to the recovery site. Most frequently in Traditional Recovery, the location at which backups and logs are stored is not the restore and recovery location.
- Transit - Transit Time: Time to move backup and logs from off-site storage vendor or off-line storage to the recovery site and system.
- System - System Restore: A cold restore. The OS and updates are loaded.
- Sys and Net Boot - Systems and network boot time.
- Database - Database Restore: Database is installed and booted. Tape backups and logs are loaded onto disk.
- Transaction - Transaction Recreation: Recovery in the broadest sense. Can include manual transactions, re-transmitting point of sale systems data, EDI and of course normal log-based transaction recovery.
2.1.2 RECOVERY APPROACHES
The typical procedures and technologies associated with each of the measured outcomes in Figure 3 follows. Procedures for individual sites vary.
- Traditional Recovery - Nightly backups are performed. Courier services pick up backups from each production site daily, i.e., the process virtually all companies have in place. Tapes usually are not kept in the location at which they will need to be applied.
- Electronic Vaulting - Bulk Data Transfer: Complete backups are physically shipped off site once weekly. Logs are batched electronically several times daily, and then loaded into a tape library located at the same facility as the planned recovery.
- Declaration Time is cut in half to about two hours. There is less time dithering about what to do because a clear contingency plan is available. In a traditional situation, the weight of the decision to declare is more ominous. Management knows if you go, you will not recover for 72 hours, and coming home is difficult at best.
- In general, as customers move to better Recovery Time Objective designs, the decision process gets more automated, and less criteria are used to make a decision. What was once a hit of 72 hours to recover is now less.
- Data Retrieval and Transit Time are eliminated, as the local tape library retrieval is not on the critical path.
- System Restore time is eliminated because a pool of warm systems is available in a large recovery data center.
- Database Restore does not decrease, as the pool systems do not have database software installed.
- Transaction Recreation is reduced from days to a few hours because there is far less to recreate via manual and electronic methods like EDI re-transmission, and batch processing apply (requiring processing before apply). Instead, database recovery using archivelogs comprises a much larger portion of the recreation process.
- Transactions Not Captured are reduced from about 24 to two hours, a reflection of the frequent log transmissions written into the remote vault.
- Transaction Protection - Automated Remote Journaling of Redo Logs. This is the same as Electronic Vaulting, except that instead of transmitting several transaction batches daily, the archivelogs are shipped as they are created. At the remote recovery center, full backups still are (physically) re-shipped weekly to minimize recovery time. As a result:
- Transactions Not Captured are reduced to about 25 minutes from about 2.5 hours (vs. Electronic Vaulting).
- Some data loss upon disaster remains because the online redo logs are lost. This is discussed more fully in the next section.
- Transaction recovery is roughly ten times shorter than Electronic Vaulting as a function of the similar reduction in Transactions Not Captured. This leaves little to be recreated by methods other than archivelog apply (e.g., EDI retransmissions).
Oracle8i Standby Database automated log shipping can be used to send the archivelogs as they are created. In pre-Oracle8i, archivelog transmission is via user-created scripts. Ongoing recovery (log apply) on the standby is not a part of this process.
- Standby Database - Continuous Application of Redo Logs: Same as Transaction Protection, except that logs are applied on a dedicated standby system. All recovery components times are roughly the same as for Transaction Protection, except for:
- Declaration: Declaration tends to be significantly shortened compared to Transaction Protection - Automated Remote Journaling of Redo Logs because with less transaction loss, there is less risk - and cost - in doing so.
In general, as customers move to better Recovery Time Objective designs, the decision process gets more automated, and the criteria to make a decision simpler. Comdisco has also seen that typically when customers start off small with their recovery solutions (Electronic Vaulting), they are still inexperienced with relying on it. As they move to more advanced designs, they have more experience, practice, and confidence. Some customers with more advanced designs actually increase their frequency of declarations. In July of 1999, when the Chicago Markets went down with power problems, one Comdisco customer declared a disaster as a pre-emptive strike because they were adjacent to the affected area. When things settled down, they came home. While they did not actually use their DR system, they were ready to do it. In July they had confidence in their DR System that they did not have a few years ago.
As DRC's (Disaster Recovery Coordinators) and IT staffs get more experienced, they tend to work on how to shorten the components of the RT. 'Decision Making' tends to be streamlined. In some instances, the declaration process has moved from senior executives down to senior line managers over time.
- Database Restore. Database Restore time is eliminated because the system is already loaded and continuously applying logs on the standby prior to the failure.
System and Network boot time stays about the same (vs. Transaction Protection), as the standby still must be reconfigured to become the primary. Transaction loss is the same as Transaction Protection because only archive logs are shipped. Some data loss upon disaster remains because the online redo logs are lost.
- Hot System - Using Oracle Advanced Replication: In terms of the recovery time components, this is the same as Standby Database Continuous Application of Redo Logs, except that system and network booting is virtually eliminated. As a result:
- System and network booting and reconfiguration time are reduced to nearly nothing compared to the Standby Continuous Log Apply method.
Two systems are kept online in different locations, though Advanced Replication can be configured for more than two systems for other purposes. The notion of primary and backup sites can be immaterial because Oracle Advanced Replication allows updates to both systems.
Copies of the database are maintained at both locations and transactions exchanged. Conflict detection and rules-based resolution keep the replicated database consistent. When updates are permitted on both secondary and primary systems, some data is lost as a result of a catastrophic outage. This is the configuration used in Figure 3.
Alternatively, to configure for zero data loss, updates can be performed on a designated primary, with unidirectional and synchronous transaction updating to the backup site. The backup system's resources are used for other purposes, including reporting against the production database. Section 3.3 ('Advanced Replication for Disaster Recovery') discusses when Oracle Advanced Replication is selected for disaster recovery.
MATURITY, LEGACY METHODS, AND CHANGED COST TRADE-OFFS
Some of the differences amongst the methods discussed above stem from business maturity differences between the firms likely to pick each of the different technologies. This effect is discussed above for Declaration Time.
Another trend is driven by both maturity and the fact that the economics of both Electronic Vaulting and Transaction Protection (older, 'legacy' techniques) have changed over the years. For both, the management costs of handling tapes propel administrative costs higher compared to Standby Database - Continuous Application of Redo Logs and Hot System - Advanced Replication. Change control and people efforts not only make cataloging tapes (vs. immediately applying) more expensive, the process has more manual content and is therefore a possible source of failure. The companies implementing these probably don't realize they are using methods that force relatively higher administrative costs until they implement another method.
Electronic Vaulting is used by some of Comdisco's larger customers with lots of data and lots of organizational maturity. Declining IT hardware costs along with higher labor costs are driving smaller and newer customers to bypass this scheme.
Oracle Standby Database performing Continuous Application of Redo is the disaster recovery solution most frequently used for Oracle mission critical applications. Figure 3 shows one reason for this. Oracle Advanced Replication implementing Hot Standby provides slightly improved failover time compared to using Oracle Standby Database for Continuous Log Apply; this is because of the elimination of reboot and reconfiguration time. These two methods are the only two providing very low failover time and lost data. Of these two, Oracle Standby Database is chosen most frequently because it is easier to implement and does not require application program modification.
3. DISASTER RECOVERY
SOLUTIONS DESCRIBED AND CONTRASTED
3.1 STANDBY DATABASE DESCRIBED
To create a Standby Database, a full backup of the primary (production) database is made and restored on the standby. After it has been configured as a Standby system, the Standby Database is started up in nomount mode.
All subsequent archived redo logs generated by the primary system are shipped to the standby system. Some customers do not apply these logs at the Standby site. Instead, they use Standby Database as described above under Transaction Protection: Automated Remote Journaling of Redo Logs. However, in the typical case, the Standby Database is maintained in a perpetual recovery mode; logs are applied as they are received.
Should the original database fail, the standby database can be configured for production and opened.
The transactions lost upon failure include:
- All online redo logs not archived
- Possibly an archived redo log transmitted at the time of the disaster. Partial archivelogs can not be recovered.
3.1.1 STANDBY DATABASE
ADVANTAGES FOR ALL RELEASES
For all the current Oracle releases (7.3.X, 8.X, 8i) the advantages of using Standby Database:
- Easy to understand and implement compared to other advanced methods. Because it uses standard Oracle recovery, all DBAs who understand recovery already have much of the requisite knowledge to implement Standby.
- Requires the least communication bandwidth compared to other solutions because only log shipping is required after setup.
- Completely application-transparent. All Oracle databases can be backed up and recovered, so all can use Standby technology.
- The performance impact on the production systems is zero.
- Very large systems can use Standby Database. The shipping and application of logs via ongoing recovery is almost never slower than the primary's log production, except when data communications and/or standby system bandwidth is insufficient.
3.1.2 STANDBY DATABASE ISSUES FOR ALL RELEASES
When implementing Oracle Standby Database, best practices should be created with the awareness of the following issues:
- Once a standby database is moved from nomount to open read/write, it cannot be put back in standby mode. This is because the Standby ceases to be a block-for-block replica of the primary, a requirement for recovery to work. To recreate the standby, a full backup of the primary once again must be created and then applied at the Standby before the resumption of log apply can recommence.
- If failover to the Standby occurs as a result of a disaster, failback to the original primary is in fact database recreation. To move processing back to the original primary after a disaster, the new primary (old Standby) must be completely backed up, and then the original primary system rebuilt, just as in the creation of any new Standby system.
- NOTE: There is one way to 'failover' and 'failback' using Oracle Standby Database. Paradoxically, this works only when no failure has occurred. If the primary can be shut down in a consistent state, and all logs applied to the Standby, the primary and Standby can switch roles. One way this is used is to reduce planned downtime due to hardware upgrades. This is the most difficult Standby procedure, to be performed only by well-versed DBA's. This is detailed in 'Graceful Switch Over & Switch Back using Oracle Standby Databases', a white paper by Oracle's Lawrence To.
- Any logs that have not been applied before the standby database is activated cannot be applied afterwards.
- The Standby system must be binary compatible with the primary system in order to read the redo logs. The same software versions for both the operating system and the Oracle database must be used, and on the same hardware architecture.
3.1.3 STANDBY DATABASE ENHANCEMENTS IN ORACLE8I
The Oracle Standby Database features are nearly identical in Oracle 7.3.X and 8.0.X. Major enhancements were made in Oracle8i, including:
- Read-Only Mode: Allows Standby Database to be opened read-only for queries.
- This mode is used primarily for data validation, as logs can not be recovered when the Standby is opened read-write.
- Managed Standby Mode: Allows the Oracle software to ship, manage and apply archived logs automatically, greatly reducing the possibility of error over manual methods.
- Multiple archive log destinations: Up to four remote destinations and a total of five remote or local destinations can be specified. Each of these can be optional or mandatory. If a destination is mandatory, the (synchronous) log transmission must succeed in order to process transactions on the primary. The alternate destinations feature allows archive logs to be both automatically stored locally, as well as at multiple remote sites.
3.2 STANDBY DATABASE WITH THIRD PARTY SOLUTIONS TO ACHIEVE ZERO DATA LOSS
If the online redo logs are mirrored to a remote site whilst also using Standby Database, data loss can be eliminated. After the last complete log sent by the primary is applied, the Standby is opened using a control file created on the Standby. The online redo logs created from mirroring the primary are then recognized when the database is opened.
3.2.1 GEOGRAPHIC DISK MIRRORING
Geographic Disk Mirroring takes a set of physically disparate disks and synchronously mirrors them over a high performance communications line. Any write to a disk on one side will result in a write to the other. The local write will not return until the acknowledgement of the remote write is successful. Thus there is a requirement for high performance communications to provide both high throughput and low latency.
To achieve the necessary communications bandwidth ESCON channels over fiber are used for short distances (<60 km or 45 miles) and one or several T3 lines are used for greater distances; the telecommunications costs alone can be higher than the amortized cost of the Standby system. However, these costs are dramatically lower when used to supplement Oracle Standby Database, i.e., when solely log file changes are mirrored.
There are a number of storage solutions currently available. EMC's Symmetric Remote Data Facility (SRDF') is but one example. However, only EMC's SRDF has been submitted to - and validated as compatible by - the Oracle Storage Compatibility Program (OSCP). SRDF is required to mirror the online redo logs.
In addition to the remote mirroring software (SRDF), archived redo log transmission also requires a mirrored disk on the standby side to enable the application of the logs on the Standby. EMC's product that performs this function is TimeFinder'. Periodically the mirror is broken and the logs applied to the Standby, but only after writes have been redirected to another device. Care must be taken when breaking the mirror to ensure that all archived redo logs are left complete, or whole, as fractional archived redo logs can not be applied.
Alternatively, in Oracle8i, Standby Database's automated log shipping can be used to send and then apply the archived redo log stream. If this approach is chosen as part of a zero data loss approach, only the online redo log must be mirrored using SRDF or other means.
3.2.2 SOFTWARE DATA MIRRORING
The use software for data mirroring presents an alternative to SRDF's hardware mirroring. While none of the software solutions have been validated by the Oracle Storage Compatibility Program (OSCP), their usage presents a potential alternative to SRDF.
Software data mirroring products could be utilized to mirror both the online and archived redo logs at remote sites. These products generate changes to the remote device by mirroring at the I/O device driver level. As disk write activity is identified at the source (production) site, the write is transported across a TCP/IP path to a server at a remote site. The server at the remote site then updates the remote disk with the I/O. While the software products are not synchronous between sites, they guarantee the order of disk writes and hence offer the potential of a reliable mirror mechanism.
Oracle's policy on any devices or software used for remote mirroring is that they must be Oracle-validated as a prerequisite to full support of Oracle software. Oracle software used in these solutions will be fully supported only if the product is validated by OSCP, where Oracle developers review product architecture, and Oracle-designed tests are run, to ensure that recoverability therefore data integrity is maintained.
Oracle, Comdisco, and other vendors are committed to the OSCP program and will participate in the validation process before implementing solutions based on them.
3.2.3 COMPARING GEOGRAPHIC DISK MIRRORING AND OFTWARE DATA MIRRORING
The following is based on a recent Comdisco needs assessment for a significant SAP R/3 implementation:
- Recovery Time Objective = 8 hours
- Recovery Point Objective < 15 minutes
- Communication Lines Between Production Site and DR Site: 2 T1's, load balanced and compressed.
- Communications Line Backup Strategy: Diverse carriers, ISDN PRI
- Miles Between Production and DR Sites: 800
- Oracle Database Version: 7.3
- Database size: 800 GB
- Redo interval: 5 minutes
- Average redo archive size: 55 MB
- Redo Log Transport Mechanism: Home-grown FTP program
- Approximate Secondary Site System Cost: $900,000. Includes a customer-managed recovery solution, including HP K460 (used), EMC storage (mirrored with BCV's), and communications equipment.
3.3 ADVANCED REPLICATION FOR DISASTER RECOVERY
Oracle Advanced Replication is briefly described in Hot System - Using Oracle Advanced Replication, in section 2.1.2 ('Recovery Approaches').
3.3.1 WHEN ORACLE ADVANCED REPLICATION IS SELECTED FOR DISASTER RECOVERY
1. A no data loss, or nearly zero data loss solution is required, and a single-vendor software solution is preferred.
2. A secondary system that can be opened for both reporting and transactions is desired. Advanced Replication's transactional support for DR at the secondary system is limited, however, when configured for zero data loss. In this case, Advanced Replication is configured with a unidirectional and synchronous flow from the primary to the standby to ensure point-in-time data consistency. Any transactional input on the standby must therefore impact solely non-critical data. Alternatively, Oracle Advanced Replication can be used in other configurations, provided some data loss is acceptable. This approach provides support for both reporting and transactions on the secondary system.
3. Unlike Standby Database, a precise match of system architecture, operating system and Oracle software versions is not required.
4. Failover and back are relatively easy compared to Oracle Standby Database. Not only is failback easier after failing over during a disaster, but software upgrades are facilitated by this heterogeneity.
3.3.2 FEATURE/FUNCTION COMPARISON: ORACLE ADVANCED REPLICATION AND STANDBY DATABASE
A wide range of disaster recovery solutions can be assembled using different technologies and implementation approaches. In order to determine the correct approach, the business risk of down time must be estimated. Only the line business management that is served by the application can provide an accurate and credible assessment. Oftentimes the biggest down time costs, like lost revenue, are not directly measurable.
The active participation of line management throughout the disaster planning cycle is vital for successful implementation. The brick and mortar 'failover' must work, and in combination with the IT pieces. Every part of the disaster plan must be tested periodically. Personnel turnover is inevitable, so all portions of the plan must be documented and available off-site. The documented plan can be outdated for reasons that are not discovered until someone tries to use it once again.
A zero data loss solution can be assembled today using Oracle technology coupled with third party offerings. Oracle Standby Database is most frequently used in these configurations. In fact, across the entirety of all Oracle sites, except for traditional recovery methods, Oracle Standby Database is the most frequently used technology*
David Edborg is a senior architect with over 20 years experience in the IT industry. He is currently a consultant in Comdisco's Advanced Technology Consulting Practice, specializing in database recoveries for open systems. He specializes in advanced recovery solutions for clients with recovery time objectives of less than twenty four hours.
Mark J. Smith of Oracle Corporation is a Product Manager for High Availability & Storage Management for the Oracle database. His 20 years of product management and implementation experience includes applications, large systems, database products, and systems management. He may be reached at firstname.lastname@example.org.
*Certain portions of Oracle Corporation copyrighted materials have been reproduced herein with the permission of Oracle Corporation.