Any system providing HA services should provide continuous availability of data in all three scenarios.
While solutions exist to provide tolerance to component failure, the issue of site loss is often overlooked, with potentially dire consequences due to business interruption and loss of information. While off site tape dumps of databases have traditionally satisfied the requirements for disaster recovery for batch systems, they are typically inadequate for protecting the information in On-line Transaction Processing (OLTP) systems and e-business.
Asynchronous replication facilities can provide continuous duplication of critical OLTP and e-business application information to off site backup facilities without the high latency inherent in tape backup strategies. Once established, such an environment can be automated to ensure that information is replicated in a timely manner and the switch to backup systems accomplished with minimal business interruption.
High Availability Solutions
Hardware redundancy is often thought of as the first line of defense in continuous systems. Cluster architecture provides failover to a backup sever without losing any committed data or severing user connections. Clusters provide the quickest recovery after a hardware or database failure. If one database fails the other can take over immediately.
Hardware redundancy, or hardware mirroring, is a hardware solution for duplicating data on a disk to another disk. It comes in two flavors, RAID (redundant array of inexpensive disk) and disk mirroring.
Both of these solutions protect against disk media failure as long as the redundant storage contains valid copy of data. However, hardware redundancy cannot protect against more subtle failures that can cause corrupted data to be written to both the primary and the redundant disk. In addition, clusters and disk mirroring are not suited to a wide area network, and are usually located in the same data center. This works well for a hardware or database server failure, but does not work if the data center is lost temporarily in rolling electrical blackouts, or longer term in a disaster, such as flood, tornado, hurricane or fire. So the cluster technology must be combined with another technology that allows geographically dispersed data centers.
Replication software can provide an automatic server failover solution that can reach across LAN and WAN, providing geographic replication of data and can be used in combination with cluster technology to provide a more complete recovery solution. Replication provides high availability and disaster recovery services, affording greater protection against site failures through asynchronous, wide-area delivery of database transactions. Replication should also address the problem of potentially corrupted data, by providing transaction integrity.
Replication is usually combined with redundant hardware. A replication solution should be able to provide a “warm standby” database that is operational – that is, it is online and available for immediate switch over. It should replicate any scheme changes automatically to reduce administrative overhead in managing a standby database and it should be simple to configure.
The replication software should also be able to switch back to the primary database without loss of data when the primary database is available.
Hardware and software solutions exist today to support high availability in a Local Area Network environment. Disk mirroring, RAID technology and high-availability cluster technologies allow multiple disks and CPUs to share resources and provide automatic fail-over in the event of a disk or CPU failure. These synchronous solutions generally have severe distance limitations that preclude separating the hardware components adequately to mitigate risks associated with geographic proximity.
Disk solutions are now emerging that attempt to resolve the site protection issues by providing asynchronous mirroring that extends the ability to physically separate devices for better site protection. While synchronous approaches satisfy the requirement for committed transactions to be preserved upon a disk or CPU failure, they do present various limitations. Asynchronous mirroring, while extending protection from many site-related failures, has additional risks associated with data integrity and consistency.
Software high availability solutions are typically characterized by the ability to physically separate hardware to provide protection against site loss. For example, a simple implementation of cold standby could be accomplished by periodically restoring backup databases and transaction dumps to the standby site from the active site. But there would be considerable latency of recovery using this method.
Applications could also be written to redundantly write to two systems (synchronous updates), using two-phase commit (2PC) protocols to guarantee both systems contain committed transactions. The risk of this type of solution is the impact on operations if there is a failure in a participating system, as 2PC operations cannot complete. Generally, 2PC applications are only appropriate for very high-value transactions where availability is not the primary objective.
Replication software can be used for copying data from the primary database to one or more secondary or standby databases. The databases can be in the same data center or they can be across the world. Transactions are replicated continuously or on a scheduled basis. Warm standby replication provides a standby database that is available to take over if the primary database fails or is taken offline for any reason.
There is some latency associated with moving data to the replicated database based on the volume of transactions replicated and the network bandwidth. However, this typically provides for a much faster database recovery than restoring a database from transaction dumps. Warm standby replication is intended to offer the protection afforded by site redundancy, without the constraints of synchronous updates or the time delays of batch-oriented backup methods. By providing asynchronous, reliable delivery, applications are not impacted by the operation of the Warm Standby replication software or the availability of the standby system. A replication solution provides the geographic solution needed for disaster recovery, since it will work across the LAN and WAN.
Replication will provide automatic fail over of the database server, but client failover still needs to be dealt with. This can be done manually, which usually means client applications have to restart to connect to the new server. This will extend the latency of the failover. To shorten the time of failover, a “switching” mechanism should be implemented. This switching mechanism automates the client failover to the new server. This should be transparent to the client, who will see only a momentary delay in their transaction processing. Replication and switching should be combined to provide low latency and seamless failover of both the server and the client. This also provides the geographic failover that is so important in disaster recovery.
Warm Standby Replication solutions provide an added benefit in the form of an always-available standby database. The hardware and the database at the secondary site can be leveraged for decision support or read-only operations. By providing multiple usages of standby systems, replication provides a cost-effective alternative to redundant hardware that is only available for recovery operations.
Replication is a significant tool in providing distributed high availability services, affording greater protection against site failures through asynchronous, wide-area delivery of database transactions. Based on the needs of the system, the replication system can be augmented with switching mechanisms for continuous availability.
By providing multiple usages of standby systems, replication provides a cost-effective alternative to redundant hardware that is only available for recovery operations. In compliment with traditional hardware high availability solutions, a replication system extends support for disaster recovery requirements beyond component failure recoverability. This system can distribute data services geographically, maintaining service in the case of regional failure and/or disaster recovery.
Benefits And Limitations Of The Solutions Cluster Benefits
• Applications generally do not require awareness of physical resource changes, such as network addresses.
• Active-Active cluster HA solutions allow quick and seamless failover. Failover of data and applications is usually automatic and happens within a few minutes.
• Since clusters use the same physical copy of the data, there is no loss of data/transactions due to failover (zero-latency).
• Loss of an entire facility is not protected (electricity, network, facility, etc.)
• The solution is tightly coupled with the hardware platform. Managing planned downtime becomes an issue – routine maintenance operations like index rebuilds or planned operations like software/machine upgrades still require downtime.
• Since only one copy of the data exists, during routine operations, other nodes of the cluster are under-utilized – low hardware utilization.
Disk Mirroring Benefits
• Asynchronous disk mirroring can provide better physical protection by supporting extended physical distances.
• No loss of committed transactions in synchronous storage (mirroring/RAID) on a CPU failure.
Disk Mirroring Limitations
• No protection from data corruption introduced by the hardware/software.
• Secondary site is not guaranteed to be transitionally consistent, because data is moved at the disk/track/sector or bit level (in the case of asynchronous mirroring).
• Client application must be re-started after failure and need to be aware of failure.
• Synchronous mirroring and RAID devices can add overhead to application performance.
• Redundant/specialized high availability hardware/software can be expensive and restricted to use for backup purposes only.
• Secondary copy of data is not available for use – low hardware utilization.
• Need to replicate everything on disk, no selectivity of data replication.
• Warm standby systems can be configured over a Wide Area Network, providing protection from site failures.
• Ability to more quickly swap to the standby system in the event of failure, as backup database is already on-line.
• Data corruption is typically not replicated as transactions are logically reproduced rather than I/O blocks mirrored.
• Automatic switch over for clients using a switching mechanism, no client restart needed.
• Originating applications are minimally impacted as replication takes place asynchronously after commit of the originating transaction.
• The warm standby database is available for read-only operations, allowing better utilization of backup systems.
• Ability to resynchronize and easily switch back to primary system when it becomes available without loss of data.
• Warm standby system will be out-of-date by transactions committed at the active database that have not been applied to the standby.
• Protection is limited to components supporting Warm Standby (e.g. DBMS data sources may be protected but file systems may not be supported).
Naveen Puttagunta is a product manager at Sybase and focuses on Sybase’s database & replication product line and high availability solutions. He has worked at Sybase for more than six years in various positions in development and product management.