Using Geographical Clustering to Build Disaster Tolerant Computing Environments
- Published on October 26, 2007
Geographic clustering is not a radical new technology, but rather the logical extension of a proven technology 'highly available clustering' that has, in the past several years, gained widespread acceptance in the open systems commercial marketplace. This marketplace acceptance has been driven by clustering's proven high availability, high performance, flexibility, scalability, and reasonable cost. Virtually all leading open systems vendors, including IBM, HP, Sun, DEC, and Data General, offer a highly available clustering product. A highly available cluster is a group of independent processors networked together, sharing critical resources, that cooperate to provide application services to clients. A cluster manager agent, which runs on each processor, is the central control mechanism for providing high availability. Typically, the cluster manager process monitors the state of local hardware and software components, including the availability of other processors in the cluster.
When a change in the state of a monitored component occurs, the cluster manager process runs a predetermined event to compensate for the failure. Most advanced highly available solutions provide customized events that allow end users to tailor high availability to their own environment. Highly available clustering prevents individual components, including processors, networks, and disks, from being 'single points of failure' within a cluster.
Geographic clustering extends high availability from several systems in a single location to numerous systems spanning multiple sites spread across a wide geography. An example of a geographic cluster is shown in Figure 1.
The sample geographic cluster has two sites, separated by enough distance to prevent the same disaster from disabling both sites. The sites are connected by redundant wide area networks for both data and cluster traffic, most commonly FDDI, ATM, or T3. Each site maintains an updated copy of essential data and can run mission-critical applications. Each site is itself a highly available cluster, supplemented with additional geographic clustering software to form a distributed, or geographic, cluster.
The geographic cluster can be configured in a number of ways. For example, one site could be designated as the 'primary', the other as the 'secondary.' Transactions originate on the primary, and then are copied to the secondary. If the primary site is disabled, the secondary provides the mission-critical data and application services. Or both sites could be active at the same time, with transactions originating at either site and then updated to the other.
Simply stated, highly available clustering prevents a failure in the computer room from disabling your business. Geographic clustering prevents a site failure from disabling your business.
A geographic cluster, by using redundant sites, prevents a disaster that disables a specific site from disabling an entire computing environment. To provide this protection, a geographic cluster must be able to:
- distribute data among sites (using data mirroring or data replication)
- monitor a local site, and recover from hardware and software failures at this site (using highly available clustering)
- monitor the remote sites, shut down a site that has failed, and reintegrate the site when it comes back online (using highly available geographic clustering).
Mirroring or Replicating
Data Among Sites
To provide disaster tolerance, a system must be able to maintain updated copies of critical data at multiple sites. Data written at a local site is transmitted across a wide area network to one or more remote sites, so that all sites have an updated copy of the database. Then, if one site becomes inaccessible, the data can be accessed from one of the remaining sites, allowing a company to continue operating.
Two technologies exist for distributing data across a dispersed network of systems: mirroring and replication. CLAM Associates' Geographic High Availability, and DEC's Business Recovery Server for Disaster Recovery use mirroring provided by the product itself. Data General's Global Availability uses the replication facilities provided by a third party such as Sybase or Oracle. IBM's High Availability Cluster Multi Processing solution also provides high availability across the geography using replication facilities provided by third parties such as Sybase or Oracle.
At the physical level, the two approaches are different: Mirroring distributes disk blocks, while replication distributes transactions. Still, conceptually the two approaches are similar, and for the purposes of our discussion we'll treat them as one. The key point to understand is that the mirroring or replication system does not simply copy data from point A to point B. The mirroring or replication system must also:
- maintain the integrity of the data, whether it is a disk block or a transaction
- track data changes made to the primary site that have not yet been duplicated to the remote sites
- deliver the data quickly and efficiently across the wide area network
- allow each site within the geographic cluster to modify the data.
Responding to Component Failures
Your cluster is up and running. You're feeling good. Now something goes wrong. Let's take a look at what happens. First we'll look at what happens when a specific component fails. Following that, we'll look at what happens when a disaster cripples an entire site.
The failure of a specific component within a specific site within the geographic cluster is called a 'local' failure. The component could be a processor, a disk or disk adapter, or a local area network or local area network adapter. The basic facilities of highly available clustering handle local failures. Each system component that is a potential 'single point of failure' has an automatic replacement designated for it.
The cluster manager monitors the status of each component, detects when one of these components fails, and redistributes that component's workload to a backup component. In this way, the cluster is able to keep critical computing resources available. Take, for example, a network adapter. A network adapter is the primary connection between a processor and a network.
The cluster manager monitors a network adapter by sending keep alive packets over that adapter. If the network adapter fails, the cluster manager is not able to send packets through this adapter. The cluster manager then instructs a backup adapter to take over the network address of the failed adapter. Doing this allows component failures to be handled within site boundaries and remain transparent to the remote sites.
Responding to Disasters
You've planned for it, hoping it wouldn't happen. And then it does. It might be as catastrophic as an earthquake that devastates the west coast, or a hurricane that drenches the east coast. Then again, it might be as mundane as a water main break that washes out Main Street ' and your data center. It is when a disaster strikes that the 'geographic' aspect of geographic clustering becomes evident. Unlike the failure of a system component, which was contained within a specific site, a disaster requires that a viable site take over for the site that has been disabled. The viable site must be able to:
- detect the failure
- shut down the failed site to preserve data integrity
- continue to provide mission-critical data and application services
- reintegrate the failed site when it comes back online
These capabilities are what distinguishes geographic clustering from remote mirroring or replication. Geographic clustering software is able to respond intelligently to the loss of a site to maintain critical data and application services. No operator intervention is required. Now, let's take a look at how this is done.
Earlier, we said that geographic clustering was an extension of highly available clustering. A geographic cluster has additional software that allows the cluster manager to extends its coverage to encompass remote sites.
The remote site becomes, conceptually, another component that has a designated backup. In highly available clustering, the cluster manager on one processor exchanges 'heartbeats' with the cluster managers on the other processors in the cluster. They are, in effect taking each other's pulse so that they can detect a change in the health of a particular processor. In geographic clustering, the cluster managers at a site exchange heartbeats not only with the processors at local site, but also with the cluster managers at the remote sites.
When the cluster managers at a local site cannot communicate over any of the wide area networks connecting it to a remote site, they assume the remote site has failed. The cluster managers at the viable sites then shut down the failed site and take over its resources to preserve data integrity among the different sites. The databases at the active sites continue to be updated.
Later, when the failed site rejoins the geographic cluster, the updates that occurred during the failure are copied to it. This process is handled by either the geographic mirroring or the data replication component of the geographic clustering software. The time necessary for updating the site that is rejoining depends on:
- the amount of data that has been updated since the last outage
- network bandwidth available
- network traffic
During this time, the processors at the viable sites continue operations.
A Framework For
Evaluating Geographic Clustering
Geographic clusters are now commercially available. Products available now include CLAM Associates' Geographic High Availability, Data General's Global Availability, and DEC's Business Recovery Server for Disaster Tolerance.
More are undoubtedly on the way. As the person responsible for safeguarding your company's computing resources, you may want to evaluate geographic clustering as a platform for building a disaster tolerant computing environment. As you evaluate the various geographic clustering products, ask yourself the following questions:
- Does the solution provide reliable data integrity? Does it prevent data from being accidentally corrupted or altered? The product's remote mirroring or data replication component must ensure that, if a site fails, the surviving sites' data is consistent with the failed site's data.
When the failed site reintegrates into the cluster, the software must update that site with the current data from the operable sites, once again ensuring data consistency.
- Is performance degraded? What is the impact of data mirroring or replication on system and application performance. Synchronous mirroring ensures the concurrency of the databases, but does not match the performance of asynchronous mirroring.
- Does it provide automatic notification of failure? The software should not only automatically detect a site failure, but also notify you that a site has failed.
- Does it provide automatic fall over? The product should provide recovery scripts to transfer control of data and applications to the operable sites when a site fails. An automated procedure results in less costly downtime while eliminating the possibility of inducing system failure during recovery.
- How long does it take to restore service? Minutes? Hours? Days? Or does it provide continuous availability? The product should provide fast recovery of mission-critical data and applications at the operable sites. Depending on the solution, recovery times will typically range from a few minutes to an hour, although individual times will vary depending on the amount of resources that need to be shifted to the surviving sites and the amount of application recovery processing required.
- Is the product flexible? Scalable? The product should support a wide range of configurations, allowing you to configure the disaster recovery solution unique to your needs. Possible configurations can range from an online backup machine turned on nightly to receive an updated copy of a database to a concurrent access configuration where all sites have simultaneous access to the same database.You should be able to scale a geographic cluster simply by adding memory and I/O controllers to individual processors, by swapping in more powerful processors, or by adding more processors.
- Can the sites be separated by a significant distance? The software should impose no constraints on the distance between sites, so that you can separate the sites by enough distance to ensure that the same disaster does not disable all sites.
- Is the solution file system and database independent? Applications configured to run in a geographic cluster should not have to be modified in any way. In this sense the product should be a 'generic' solution that works with any database management system.
Thomas Casey is a team leader of the Technical Publishing Group and Steven Kohler is a principal support engineer with CLAM Associates.
This article adapted from Vol. 9#3.