1. DECISION-MAKING: HOW TO DETERMINE WHEN SOLUTIONS ARE REQUIRED
1.1 WHAT IS AT STAKE
Some examples follow to illustrate the importance of disaster recovery (DR) planning:
- A study of companies that suffered a catastrophic data loss found that 43% never reopened and 51% closed within two years. Only 6% survived. (University of Texas)
- Of those firms hit by the World Trade Center bombing, 50% of the businesses without a DR plan were out of business within 2 years. (University of Texas)
In the abstract, many managers outside of IT agree conceptually that 'something' should be done to minimize risk. When faced with the high costs of providing protection, many naturally balk, and not only because of the normal reasons. Risk management decisions are by their nature abstract, but more complexity is added on top of the everyday risk management decision. Which applications are mission-critical, and therefore pose a risk to the company's survival if they are lost? How much protection should - or can - be offered? What must be done make sure that disaster plans reduce most, not just some of the risk?
Even though disaster recovery (DR) solutions can be a tough sell, the awareness of their importance is surging. One indicator of this is the growth of DR-related spending. But first some definitions:
Business Continuity Planning: An all-inclusive plan, including IT and non-IT areas, to continue business functions when something goes wrong.
Business Continuity Services: Consulting, implementation and/or facilities used to implement the Business Continuity Plan.
Business Continuity Services (BCS) spending trends:
- $3 billion/year worldwide market for solely this outsourced portion of the market
- 15-20% per year compounded annual growth rate. Currently BCS spending comprises:
- 3-4% of non-financial IT spending
- 5-7% for financial services (Source: GIGA 1/15/98)
Those firms that already have Business Continuity Plans in place feel the pressure to do more:
- ''Enterprises that today tolerate two-day recovery-time objectives will see that horizon diminish to one day or less by year-end 1999'' Disaster Recovery: Weighing Data Replication Alternatives -D. Scott, Gartner Group, 2/24/98
1.2 COST OF DOWN TIME: WHAT ARE YOUR OPTIONS?
The cost of business continuity planning implementation varies greatly depending on the required level of protection. To define the methods and products that should be used, two objectives must be set:
Recovery Time Objective (RTO): The time required to recover critical systems to a functional state. In a HA environment this might be called Failover Time Objective.
RPO - Recovery Point Objective: The maximum time interval, from the data loss backward, for which transaction loss is tolerable. As an equation:
Recovery Point Objective (RPO) = Time of disaster (data loss) - Time of last recoverable transaction
Example: The U.S. Securities and Exchange Commission (SEC) requires brokerages to recover all transactions. A business in this environment may have an RPO of milliseconds, effectively making the Time of Failure equal to the required recovery point.
In order to determine Recovery Time Objective and Recovery Point Objective, the cost of downtime and lost data must be quantified. Involvement of all departments that are the application's 'customers' must start early because estimating down time costs requires applying business judgement only line functions can provide. Subjective assumptions must be made in order to estimate the cost of down time in the many cases when adequate quantitative data are too costly - or impossible - to create. For example, if a Web E-commerce site goes down for a day, how many customers will leave for a competitor, never to return? What is the resultant lost revenue? These 'soft' costs - impossible to measure by normal accounting practices - can now comprise a far greater portion of total costs than the operationally measurable costs. Deriving the Recovery Time Objective therefore requires credibility only the business groups can bring to these estimates.
This is just one of several reasons that involving all user departments throughout - from justification through implementation - is imperative. Figure 1 illustrates that the resumption of mission critical business processes depends as much on the 'brick and mortar' issues like physical relocation as on the system recovery issues. The lower portion of the time line (ending with 'Backlog') depicts the 'brick and mortar' recovery steps, including manual workarounds when the system is down. Because any recovery depends upon the synchronization of both non-IT as well as IT infrastructure recovery, a Business Continuity Plan proposal, and its ongoing implementation, must incorporate both.

The key factor driving up the cost of downtime is the reduced availability of viable alternatives - both for the firm, and its suppliers. On the other hand, the alternatives available to customers have increased markedly. The cost of changing vendors is falling; today they do not even have to walk away, but merely click on another browser bookmark.
For firms, alternatives to down systems have been dramatically reduced. In the old days of card readers, nearly every critical process was recoverable by other means, usually manual. The Backlogged Transactions time line in Figure 1 depicts the backlog accumulated during the system's down time. Orders could be written down for later entry. In the past, manufacturing worked with printed work orders that normally were updated first in writing, and then key-entered. These transactions are applied in the 'Backlog' box above. Today, if there is any way to capture these lost transactions, it is more likely to be via some electronic means such as re-sending transactions normally sent via electronic data interchange (EDI). If an order processing system is servicing Web users, none of the transactions are recoverable in the case of a disaster.
The latest application innovations all leave firms with few or no options when they fail. ERP/supply chain management, E-commerce, and financial applications that provide strategic advantage all are justified by increased responsiveness or decreased cycle time. The resulting environments that justified these systems in the first place create corporate high wire acts. For example, it is not uncommon for corporations that implement both ERP and supply chain integration to reduce inventory by 75%. But if the systems go down, the inventory buffer against a completely stopped enterprise is now only 25% of its former size.
1.3 CALCULATING COST OF DOWNTIME
Unfortunately, there is no single easy method for deriving the downtime cost per hour because of the breadth of issues and variation in business environments across sites. Furthermore, conversion of downtime cost/hour to Recovery Time Objective and Recovery Point Objective is partially subjective.
Contingency Planning Research and Dataquest (11/97) estimated the hourly cost of downtime for a number of applications. The results ranged from $14.5K/hour for the loss of ATM services, to $6.5 million/hour for brokerage services. The case of brokerage services provides a great example of the impact on costs when there are no options. The SEC requires brokerages to recover all completed transactions to the point of failure with-in seventy-two hours.
Complicating Business Continuity justification still more, the hourly cost of down time normally is not linear. For example, take the case of an actual manufacturer with 30 plants running SAP. Their staff estimated:
- Downtime cost - $22K/hour, from which they implied a linear relationship:
- 8 hour cost - $170,000
- 24 hour cost - $528,000
- 48 hour cost - $1,056,000
This may be true over two days, and reflects a scenario where much of the production stays up. However, with just-in-time inventory, the point at which production lines stop altogether can be just two or three days. Similarly, if a Web E-commerce site is down for just an hour, that hour's cost certainly must be much lower than the 48th hour; at some point the firm fails. For critical applications like these, the cost per hour can rise exponentially after just a day or two. Figure 2 illustrates the acceleration of the firm's down time cost (Outage Cost) as a function of time.

1.4 FACTORS TO CONSIDER WHEN CALCULATING COST
The next section discusses different disaster recovery technologies. It could be tempting to think that for each technology, a Recovery Cost curve, like that in Figure 2, can be derived in a fairly routine fashion. But determining the cost of an approach must take into account more factors than a normal procurement. Factors unique to each firm can create significant cost variability amongst firms applying identical technologies.
1.4.1 SITE-SPECIFIC FACTORS INFLUENCING IMPLEMENTATION COSTS
1. Geographic Hazards: The further the recovery location from the production site, the more expensive the implementation. If the data center is in an active earthquake zone, a good rule of thumb is that the off-site recovery location must be at least 200-250 miles away, in a direction at right angles to the direction of the major fault. A similar distance rule of thumb applies to those in areas with a history of hurricanes close by. Other issues to consider include the frequency of ice and snow storms and whether or not the computing center is on a flood plain. All of these risks must be aggregated in order to determine the zone beyond which risk can not be meaningfully reduced. There are a number of resources to assist with this, including:
A. In the U.S., the Federal Emergency Management Agency provides maps detailing the history of natural disasters. Called Emergency Preparedness USA, some of this information can be accessed at their Web site at www.fema.gov.
B. There is likely an emergency management coordinator serving your area. Most frequently, these are county, state and federal officials.
C. Frequently there are municipal emergency management risk maps available.
2. Utilities: The likelihood of both power spikes as well as disastrous outages varies, and largely by location. For example, those in the U.S. west of the continental divide are significantly more exposed. This must be factored into decisions for items like power conditioners and even backup power generators, and for both primary and backup sites.
Less obvious is the impact of location within each of the utility delivery grids. Paradoxically, New Jersey is a good alternate site for New York City, and visa versa. This is true not only because the disasters with typically wide geographic impacts (earthquakes, hurricanes, or debilitating ice storms) are unlikely, but because each is served by a separate power and phone grid. When disaster strikes, local governments' ability to deal with road governance and infrastructure issues makes them 'utilities'. New York City and New Jersey also have separate 'utilities' for government services delivery.
3. The complete list is huge. Just a few other factors to consider include the location of nearby facilities likely to produce disasters and whether the impact of any disaster is likely amplified by the presence of nearby hazardous materials.
Many of these issues are clearly insurance issues, or their flip side for self-insured firms, risk management issues. In recent years, the pace of insurance innovation has shifted into high gear. Banks have long recognized the benefits of combining many different - i.e., uncorrelated - assets, such as loans, shares, and bonds, to lower the aggregate risk of their portfolio. The Economist (9/4/99) notes that, 'Working on that principle, insurers have started bundling traditional and non-traditional risks - exchange-rate, business interruption, fire, and so on - and selling their clients protection against all of them with so-called 'multi-trigger' policies.' Whether or not your firm is self-insured, if you have risk management experts available to you, you might consider consulting them.
4. Application needs: Section 2 ('Disaster Recovery Solutions Contrasted Using Comdisco Customer Data') discusses the most frequently used DR technologies. Certain technologies are not discussed there that nevertheless are rational choices for some situations. They are used less frequently because they require application code modification. These include queuing and transaction processing monitors. If your organization plans to write and implement mission-critical systems using these or similar technologies, the incremental costs of including DR-awareness sometimes can be justified.
1.4.2 UNIVERSAL FACTORS IMPACTING IMPLEMENTATION COSTS
Figure 1 illustrated the need for integrating all user departments, as well as facilities and other groups to ensure that the facilities and people using the application can 'fail over' as well as the computer systems. For both the systems as well as people, the success of a disaster recovery solution depends as much on good execution in the following generic areas, as on technology. Because of cultural, organizational and operational factors, each organization's implementation cost will vary -- sometime significantly -- even for firms of similar size and operational issues. The following best practices are generic to any disaster recovery plan implementation:
1. All recovery processes are defined, and for all units in the organization, low-tech 'brick and mortar' included.
2. The processes are repeatable and documented. Example: Imagine that the primary system is inaccessible, the backup site to be activated must be chosen from two backup locations, and anyone who can make the decision and bring up the systems is not at any of the three sites at the time of disaster. At minimum, all these administrators should know how to access each of the systems from off site. As well, once signed onto any one site, they should be able to retrieve the plan, decide which standby system to activate according to established decision rules (including the impact on user groups' plans), and then make it happen (with all the needed passwords and permissions).
3. Ongoing testing of each piece is planned and executed. This is less painful than it sounds. Everyday work can be tailored to exercise pieces of the recovery plan. A good low-cost example is routine installation and testing of new software and configurations on the standby system. But somehow testing of all the pieces must take place periodically.
4. Metrics are established and used to evaluate readiness. Again, this can be integrated into other information technology best practices. The requisite applications and systems monitoring for performance against Service Level Agreements can also be applied, and initially tested, on the off-site systems.
5. The program is continuously reexamined to ensure checkpointing amongst the many and disparate groups that will have to work together if there is a disaster.*
2. DISASTER RECOVERY SOLUTIONS CONTRASTED USING COMDISCO CUSTOMER DATA
Figure 3: Average Time to Recover (Source: Comdisco Recovery Services, Inc.)
Fig. 3 shows recovery time lines for a number of recovery techniques decomposed into stages of recovery. This chart is based upon actual customer data accumulated by Comdisco Recovery Services.
The horizontal axis shows the time of failure as '0' hours, with Hours of Lost Transactions (Recovery Point Objective) to the left (negative) and Hours Required to Resume Business (Recovery Time Objective) in the positive range. Recall that RPO and RTO are planning objectives, and are noted in Figure 3 to show their corresponding term - and measured experience - for each recovery technique.

2.1 FIGURE 3 DEFINITIONS
2.1.1 RECOVERY POINT OBJECTIVE/RECOVERY TIME OBJECTIVE COMPONENTS
- Transactions - Transactions Not Captured: Those transactions permanently lost (in product system's electronic format); i.e., the average interval of data not migrated off site at the time of the disaster.
- Declaration - Decision-making process, and the mechanics of getting the failover site and/or vendor notified. Includes determining whether to fail over and sometimes also where to do so.
- Data - Data Retrieval: Getting backup and logs out of storage and ready to ship to the recovery site. Most frequently in Traditional Recovery, the location at which backups and logs are stored is not the restore and recovery location.
- Transit - Transit Time: Time to move backup and logs from off-site storage vendor or off-line storage to the recovery site and system.
- System - System Restore: A cold restore. The OS and updates are loaded.
- Sys and Net Boot - Systems and network boot time.
- Database - Database Restore: Database is installed and booted. Tape backups and logs are loaded onto disk.
- Transaction - Transaction Recreation: Recovery in the broadest sense. Can include manual transactions, re-transmitting point of sale systems data, EDI and of course normal log-based transaction recovery.
2.1.2 RECOVERY APPROACHES
The typical procedures and technologies associated with each of the measured outcomes in Figure 3 follows. Procedures for individual sites vary.
- Traditional Recovery - Nightly backups are performed. Courier services pick up backups from each production site daily, i.e., the process virtually all companies have in place. Tapes usually are not kept in the location at which they will need to be applied.
- Electronic Vaulting - Bulk Data Transfer: Complete backups are physically shipped off site once weekly. Logs are batched electronically several times daily, and then loaded into a tape library located at the same facility as the planned recovery.
- Declaration Time is cut in half to about two hours. There is less time dithering about what to do because a clear contingency plan is available. In a traditional situation, the weight of the decision to declare is more ominous. Management knows if you go, you will not recover for 72 hours, and coming home is difficult at best.
- In general, as customers move to better Recovery Time Objective designs, the decision process gets more automated, and less criteria are used to make a decision. What was once a hit of 72 hours to recover is now less.
- Data Retrieval and Transit Time are eliminated, as the local tape library retrieval is not on the critical path.
- System Restore time is eliminated because a pool of warm systems is available in a large recovery data center.
- Database Restore does not decrease, as the pool systems do not have database software installed.
- Transaction Recreation is reduced from days to a few hours because there is far less to recreate via manual and electronic methods like EDI re-transmission, and batch processing apply (requiring processing before apply). Instead, database recovery using archivelogs comprises a much larger portion of the recreation process.
- Transactions Not Captured are reduced from about 24 to two hours, a reflection of the frequent log transmissions written into the remote vault.
- Transaction Protection - Automated Remote Journaling of Redo Logs. This is the same as Electronic Vaulting, except that instead of transmitting several transaction batches daily, the archivelogs are shipped as they are created. At the remote recovery center, full backups still are (physically) re-shipped weekly to minimize recovery time. As a result:
- Transactions Not Captured are reduced to about 25 minutes from about 2.5 hours (vs. Electronic Vaulting).
- Some data loss upon disaster remains because the online redo logs are lost. This is discussed more fully in the next section.
- Transaction recovery is roughly ten times shorter than Electronic Vaulting as a function of the similar reduction in Transactions Not Captured. This leaves little to be recreated by methods other than archivelog apply (e.g., EDI retransmissions).
Oracle8i Standby Database automated log shipping can be used to send the archivelogs as they are created. In pre-Oracle8i, archivelog transmission is via user-created scripts. Ongoing recovery (log apply) on the standby is not a part of this process.
- Standby Database - Continuous Application of Redo Logs: Same as Transaction Protection, except that logs are applied on a dedicated standby system. All recovery components times are roughly the same as for Transaction Protection, except for:
- Declaration: Declaration tends to be significantly shortened compared to Transaction Protection - Automated Remote Journaling of Redo Logs because with less transaction loss, there is less risk - and cost - in doing so.
In general, as customers move to better Recovery Time Objective designs, the decision process gets more automated, and the criteria to make a decision simpler. Comdisco have also seen that typically when customers start off small with their recovery solutions (Electronic Vaulting), they are still inexperienced with relying on it. As they move to more advanced designs, they have more experience, practice, and confidence. Some customers with more advanced designs actually increase their frequency of declarations. In July of 1999, when the Chicago Markets went down with power problems, one Comdisco customer declared a disaster as a pre-emptive strike because they were adjacent to the affected area. When things settled down, they came home. While they did not actually use their DR system, they were ready to do it. In July they had confidence in their DR System that they did not have a few years ago.
As DRC's (Disaster Recovery Coordinators) and IT staffs get more experienced, they tend to work on how to shorten the components of the RT. 'Decision Making' tends to be streamlined.
In some instances, the declaration process has moved from senior executives down to senior line managers over time.
- Database Restore. Database Restore time is eliminated because the system is already loaded and continuously applying logs on the standby prior to the failure.
System and Network boot time stays about the same (vs. Transaction Protection), as the standby still must be reconfigured to become the primary. Transaction loss is the same as Transaction Protection because only arhivelogs are shipped. Some data loss upon disaster remains because the online redo logs are lost. - Hot System - Using Oracle Advanced Replication: In terms of the recovery time components, this is the same as Standby Database Continuous Application of Redo Logs, except that system and network booting is virtually eliminated. As a result:
- System and network booting and reconfiguration time are reduced to nearly nothing compared to the Standby Continuous Log Apply method.
Two systems are kept online in different locations, though Advanced Replication can be configured for more than two systems for other purposes. The notion of primary and backup sites can be immaterial because Oracle Advanced Replication allows updates to both systems. Copies of the database are maintained at both locations and transactions exchanged. Conflict detection and rules-based resolution keep the replicated database consistent. When updates are permitted on both secondary and primary systems, some data is lost as a result of a catastrophic outage. This is the configuration used in Figure 3.
Alternatively, to configure for zero data loss, updates can be performed on a designated primary, with unidirectional and synchronous transaction updating to the backup site. The backup system's resources are used for other purposes, including reporting against the production database. Section 3.3 ('Advanced Replication for Disaster Recovery') discusses when Oracle Advanced Replication is selected for disaster recovery.
2.1.3 BUSINESS MATURITY, LEGACY METHODS, AND CHANGED COST TRADE-OFFS
Some of the differences amongst the methods discussed above stem from business maturity differences between the firms likely to pick each of the different technologies. This effect is discussed above for Declaration Time.
Another trend is driven by both maturity and the fact that the economics of both Electronic Vaulting and Transaction Protection (older, 'legacy' techniques) have changed over the years. For both, the management costs of handling tapes propel administrative costs higher compared to Standby Database - Continuous Application of Redo Logs and Hot System - Advanced Replication. Change control and people efforts not only make cataloging tapes (vs. immediately applying) more expensive, the process has more manual content and is therefore a possible source of failure. The companies implementing these probably don't realize they are using methods that force relatively higher administrative costs until they implement another method. Electronic Vaulting is used by some of Comdisco's larger customers with lots of data and lots of organizational maturity. Declining IT hardware costs along with higher labor costs are driving smaller and newer customers to bypass this scheme.
2.1.4 CONCLUSIONS
Oracle Standby Database performing Continuous Application of Redo is the disaster recovery solution most frequently used for Oracle mission critical applications. Figure 3 shows one reason for this. Oracle Advanced Replication implementing Hot Standby provides slightly improved failover time compared to using Oracle Standby Database for Continuous Log Apply; this is because of the elimination of reboot and reconfiguration time. These two methods are the only two providing very low failover time and lost data. Of these two, Oracle Standby Database is chosen most frequently because it is easier to implement and does not require application program modification.
3. DISASTER RECOVERY SOLUTIONS DESCRIBED AND CONTRASTED
3.1 STANDBY DATABASE DESCRIBED
Figure 4: Oracle Standby Database
To create a Standby Database, a full backup of the primary (production) database is made and restored on the standby. After it has been configured as a Standby system, the Standby Database is started up in nomount mode.
All subsequent archived redo logs generated by the primary system are shipped to the standby system. Some customers do not apply these logs at the Standby site. Instead, they use Standby Database as described above under Transaction Protection: Automated Remote Journaling of Redo Logs. However, in the typical case, the Standby Database is maintained in a perpetual recovery mode; logs are applied as they are received.
Should the original database fail, the standby database can be configured for production and opened.
The transactions lost upon failure include:
- All online redo logs not archived
- Possibly an archived redo log transmitted at the time of the disaster. Partial archivelogs can not be recovered.

3.1.1 STANDBY DATABASE ADVANTAGES FOR ALL RELEASES
For all the current Oracle releases (7.3.X, 8.X, 8i) the advantages of using Standby Database:]
- Easy to understand and implement compared to other advanced methods. Because it uses standard Oracle recovery, all DBAs who understand recovery already have much of the requisite knowledge to implement Standby.
- Requires the least communication bandwidth compared to other solutions because only log shipping is required after setup.
- Completely application-transparent. All Oracle databases can be backed up and recovered, so all can use Standby technology.
- The performance impact on the production systems is zero.
- Very large systems can use Standby Database. The shipping and application of logs via ongoing recovery is almost never slower than the primary's log production, except when data communications and/or standby system bandwidth is insufficient.
3.1.2 STANDBY DATABASE ISSUES FOR ALL RELEASES
When implementing Oracle Standby Database, best practices should be created with the awareness of the following issues:
- Once a standby database is moved from nomount to open read/write, it cannot be put back in standby mode. This is because the Standby ceases to be a block-for-block replica of the primary, a requirement for recovery to work. To recreate the standby, a full backup of the primary once again must be created and then applied at the Standby before the resumption of log apply can recommence.
- If failover to the Standby occurs as a result of a disaster, failback to the original primary is in fact database recreation. To move processing back to the original primary after a disaster, the new primary (old Standby) must be completely backed up, and then the original primary system rebuilt, just as in the creation of any new Standby system.
- NOTE: There is one way to 'failover' and 'failback' using Oracle Standby Database. Paradoxically, this works only when no failure has occurred. If the primary can be shut down in a consistent state, and all logs applied to the Standby, the primary and Standby can switch roles. One way this is used is to reduce planned downtime due to hardware upgrades. This is the most difficult Standby procedure, to be performed only by well-versed DBA's. This is detailed in 'Graceful Switch Over & Switch Back using Oracle Standby Databases', a white paper by Oracle's Lawrence To.
- Any logs that have not been applied before the standby database is activated cannot be applied afterwards.
- The Standby system must be binary compatible with the primary system in order to read the redo logs. The same software versions for both the operating system and the Oracle database must be used, and on the same hardware architecture.
3.1.3 STANDBY DATABASE ENHANCEMENTS IN ORACLE8I
The Oracle Standby Database features are nearly identical in Oracle 7.3.X and 8.0.X. Major enhancements were made in Oracle8i, including:
- Read-Only Mode: Allows Standby Database to be opened read-only for queries.
- This mode is used primarily for data validation, as logs can not be recovered when the Standby is opened read-write.
- Managed Standby Mode: Allows the Oracle software to ship, manage and apply archived logs automatically, greatly reducing the possibility of error over manual methods.
- Multiple archive log destinations: Up to four remote destinations and a total of five remote or local destinations can be specified. Each of these can be optional or mandatory. If a destination is mandatory, the (synchronous) log transmission must succeed in order to process transactions on the primary. The alternate destinations feature allows archive logs to be both automatically stored locally, as well as at multiple remote sites.
3.2 STANDBY DATABASE WITH THIRD PARTY SOLUTIONS TO ACHIEVE ZERO DATA LOSS
If the online redo logs are mirrored to a remote site whilst also using Standby Database, data loss can be eliminated. After the last complete log sent by the primary is applied, the Standby is opened using a control file created on the Standby. The online redo logs created from mirroring the primary are then recognized when the database is opened.
3.2.1 GEOGRAPHIC DISK MIRRORING
Figure 5: Geographic Disk Mirroring
Geographic Disk Mirroring takes a set of physically disparate disks and synchronously mirrors them over a high performance communications line. Any write to a disk on one side will result in a write to the other. The local write will not return until the acknowledgement of the remote write is successful. Thus there is a requirement for high performance communications to provide both high throughput and low latency.
To achieve the necessary communications bandwidth ESCON channels over fiber are used for short distances (<60 km or 45 miles) and one or several T3 lines are used for greater distances; the telecommunications costs alone can be higher than the amortized cost of the Standby system. However, these costs are dramatically lower when used to supplement Oracle Standby Database, i.e., when solely log file changes are mirrored.
There are a number of storage solutions currently available. EMC's Symmetric Remote Data Facility (SRDF) is but one example. However, only EMC's SRDF has been submitted to - and validated as compatible by - the Oracle Storage Compatibility Program (OSCP). SRDF is required to mirror the online redo logs.
In addition to the remote mirroring software (SRDF), archived redo log transmission also requires a mirrored disk on the standby side to enable the application of the logs on the Standby. EMC's product that performs this function is TimeFinder. Periodically the mirror is broken and the logs applied to the Standby, but only after writes have been redirected to another device. Care must be taken when breaking the mirror to ensure that all archived redo logs are left complete, or whole, as fractional archived redo logs can not be applied.
Alternatively, in Oracle8i, Standby Database's automated log shipping can be used to send and then apply the archived redo log stream. If this approach is chosen as part of a zero data loss approach, only the online redo log must be mirrored using SRDF or other means.

Figure 5
3.2.2 SOFTWARE DATA MIRRORING
The use software for data mirroring presents an alternative to SRDF's hardware mirroring. While none of the software solutions have been validated by the Oracle Storage Compatibility Program (OSCP), their usage presents a potential alternative to SRDF.
Software data mirroring products could be utilized to mirror both the online and archived redo logs at remote sites. These products generate changes to the remote device by mirroring at the I/O device driver level. As disk write activity is identified at the source (production) site, the write is transported across a TCP/IP path to a server at a remote site. The server at the remote site then updates the remote disk with the I/O. While the software products are not synchronous between sites, they guarantee the order of disk writes and hence offer the potential of a reliable mirror mechanism.
Oracle's policy on any devices or software used for remote mirroring is that they must be Oracle-validated as a prerequisite to full support of Oracle software. Oracle software used in these solutions will be fully supported only if the product is validated by OSCP, where Oracle developers review product architecture, and Oracle-designed tests are run, to ensure that recoverability therefore data integrity is maintained.
Oracle, Comdisco, and other vendors are committed to the OSCP program and will participate in the validation process before implementing solutions based on them.
3.2.3 COMPARING GEOGRAPHIC DISK MIRRORING AND SOFTWARE DATA MIRRORING
The following is based on a recent Comdisco needs assessment for a significant SAP R/3 implementation:
- Recovery Time Objective = 8 hours
- Recovery Point Objective < 15 minutes
- Communication Lines Between Production Site and DR Site: 2 T1's, load balanced and compressed.
- Communications Line Backup Strategy: Diverse carriers, ISDN PRI
- Miles Between Production and DR Sites: 800
- Oracle Database Version: 7.3
- Database size: 800 GB
- Redo interval: 5 minutes
- Average redo archive size: 55 MB
- Redo Log Transport Mechanism: Home-grown FTP program
- Approximate Secondary Site System Cost: $900,000. Includes a customer-managed recovery solution, including HP K460 (used), EMC storage (mirrored with BCV's), and communications equipment.
Figure 6: Geographic Disk Mirroring Compared to Software Data Mirroring
3.3 ADVANCED REPLICATION FOR DISASTER RECOVERY
Oracle Advanced Replication is briefly described in Hot System - Using Oracle Advanced Replication, in section 2.1.2 ('Recovery Approaches').
3.3.1 WHEN ORACLE ADVANCED REPLICATION IS SELECTED FOR DISASTER RECOVERY
1. A no data loss, or nearly zero data loss solution is required, and a single-vendor software solution is preferred.
2. A secondary system that can be opened for both reporting and transactions is desired. Advanced Replication's transactional support for DR at the secondary system is limited, however, when configured for zero data loss. In this case, Advanced Replication is configured with a unidirectional and synchronous flow from the primary to the standby to ensure point-in-time data consistency. Any transactional input on the standby must therefore impact solely non-critical data. Alternatively, Oracle Advanced Replication can be used in other configurations, provided some data loss is acceptable. This approach provides support for both reporting and transactions on the secondary system.
3. Unlike Standby Database, a precise match of system architecture, operating system and Oracle software versions is not required.
4. Failover and back are relatively easy compared to Oracle Standby Database. Not only is failback easier after failing over during a disaster, but software upgrades are facilitated by this heterogeneity.
3.3.2 FEATURE/FUNCTION COMPARISON: ORACLE ADVANCED REPLICATION AND STANDBY DATABASE
4 CONCLUSION
A wide range of disaster recovery solutions can be assembled using different technologies and implementation approaches. In order to determine the correct approach, the business risk of down time must be estimated. Only the line business management that is served by the application can provide an accurate and credible assessment. Oftentimes the biggest down time costs, like lost revenue, are not directly measurable.
The active participation of line management throughout the disaster planning cycle is vital for successful implementation. The brick and mortar 'failover' must work, and in combination with the IT pieces. Every part of the disaster plan must be tested periodically. Personnel turnover is inevitable, so all portions of the plan must be documented and available off-site. The documented plan can be outdated for reasons that are not discovered until someone tries to use it once again.
A zero data loss solution can be assembled today using Oracle technology coupled with third party offerings. Oracle Standby Database is most frequently used in these configurations. In fact, across the entirety of all Oracle sites, except for traditional recovery methods, Oracle Standby Database is the most frequently used technology.
David Edborg is a senior architect with over 20 years experience in the IT industry. He is currently a consultant in Comdisco's Advanced Technology Consulting Practice, specializing in database recoveries for open systems. He specializes in advanced recovery solutions for clients with recovery time objectives of less than twenty four hours.
Mark J. Smith of Oracle Corporation is a Product Manager for High Availability & Storage Management for the Oracle database. His 20 years of product management and implementation experience includes applications, large systems, database products, and systems management. He may be reached at marsmith@us.oracle.com.
*The remainder of this article will be published in the Summer 2000 issue of the Disaster Recovery Journal.
Certain portions of Oracle Corporation copyrighted materials have been reproduced herein with the permission of Oracle Corporation.




