Wishing You a Speedy Recovery
- Published on October 26, 2007
Different factors can impact the availability of a system. As clusters grow into hundreds of nodes, MTBF for the entire system decreases dramatically. These large clusters are typically hosting database applications that to varying degrees cope with loss of an individual node. However, the threats to site availability through loss of key computer resources is much broader than the computing center itself. Large commercial installations have multiple dependencies on the world outside the local enterprise, both in terms of supporting services and the communications links between enterprise sites.
Disasters that hinder access to data are not restricted to earthquakes, floods, hurricanes or bombings. Sometimes they can be as small as a car accident. Take, for example, the recent mishap experienced by a regional lumber company. Their data center was completely shut off from all of its retail outlets when a car lost control and disabled the telephone pole that housed the only fiber-optic trunk in the region. A key point of failure that resulted in 20 retail outlets forced out of operation.
As you can see, not all catastrophes are large in scale but the repercussions are: lost revenue, missed opportunity, poor customer service and support, loss of infrastructure, decrease in market share - all result when you don't have access to the technology that feeds your business. And while it's inconceivable for you to address every point of failure in your business, it is possible to do the next best thing: Put into place a system that will protect your information resource and get your business back into operation as quickly as possible.
This heightened focus on computer based business solutions dictates better methods for coping with downtime. Typical traditional disaster recovery solutions for computer environments take days to put into action. Tape backup and hot site methodologies require some degree of recreation of the customer environment after the disaster. The amount of time required will vary but will always include the need to roll tape to restore the image present on the server prior to the disaster. Furthermore, some amount of information is typically lost during the switch to the hot site. Tapes are only as good as the most recent backup. For businesses that can't afford lengthy tape restoration procedures and must capture the state of their systems as of the last transaction, a need exists for a much more rapid failover with no data loss.
The remainder of this article focuses on ways to address this requirement. Dynamic replication of data is one of the best ways to achieve the desired site availability. Business is starting to look to dynamic replication as an alternative to tradition backup and restore. As open systems client server architectures become more prevalent in the business environment, new avenues for disaster recovery emerge. High speed communications, clustering, and client-server architectures make it very practical to continue operation in the face of total site loss. Let's call this process real-time geographic failover (Geo Failover) to indicate that we are substituting a server complex to replace a currently unavailable one, possibly a long distance from the original site, and we are doing it without any significant loss of service.
The key concept behind Geo Failover is the ongoing replication of an entire site to another location. This is accomplished over the network and allows for very rapid transition to the replicate copy when required. A good example to illustrate Geo Failover is a site hosting a web server, though everything described below applies equally well to intranet applications and strictly server-side services. The problems to be solved are:
1. How to allow continued access to your services after a catastrophe at the business site
2.How to efficiently and simply guarantee data access, even after site failure
3. How to automatically keep operations on-line after a total site failure
Let's tackle client access first. The distributed nature of transactions on the internet make the location of the server complex less relevant to the end-user. This means that in a disaster situation, an entire computer site can be relocated without the customer being affected. This is due to the way the internet functions, so take a look at the sequence of events to better understand how the environment lends itself to real-time wide-area failover.
1. The end-user selects the hostname www.vendor.com in their favorite web browser
2. The browser puts a query out to the internet Domain Name Service (DNS) to resolve the hostname to a specific network address through which the browser can connect. The details of DNS are beyond the scope of this article, let's just say it works!
3. The DNS returns the answer 184.108.40.206 which indicates the specific address of www.vendor.com on the internet.
4. The browser then proceeds to connect to www.vendor.com at 220.127.116.11 and the end-user does what they set out to do there; make a purchase, get product info, get a phone number, jump to another link, whatever.
The important point is that the web browser resolved the location of the site at the instant that the end-user tried to connect. It would certainly be nice to have a spare copy of www.vendor.com and switch to using it, if for instance, a metropolitan power failure disabled the original copy. The internet makes the routing part fairly easy to do by employing a technique known as DNS Spoofing. This entails dynamically changing the answer returned by the DNS to send the user to a new location. To use the above example, let's assume that hostname vendor.com is at internet address 18.104.22.168 and is physically located in Trenton, NJ. Should that site get knocked out, with DNS spoofing the next web browser that looks up that hostname get the address of the backup site for www.vendor.com. This site has address 22.214.171.124 and is located in Los Gatos, California (no earthquake jokes, please). Transparently, the entire web site has appeared in a new place. I've described a somewhat generic DNS spoofing solution, different vendors add other functionality around this core.
Okay, so we know how to move a site around using DNS spoofing, how do we keep it up-to-date? Some installations are taking real orders over the network. We don't want to lose those transactions, some sites are providing up-to-the-minute information. How do we ensure that the copies are the same? Some sort of on-line replication is called for. Rather than take regular backup tapes for safekeeping, it is more efficient to continuously send a copy of critical data over the network. The form that such data replication takes depends on the needs of the business. Solutions exist from the very specific to the very general. The general idea of all the replication schemes is that data written on one side gets propagated to a backup machine. The type of data copied, the flexibility in choosing which data, the complexity of management and the ease of recovery all vary across different solutions. The common theme is that rather than specifically creating a backup copy, the system does this on an ongoing basis.
Database replication allows a business to create copies of the data that is considered critical. These solutions offer much flexibility around the choice of which data is replicated and how often the replication occurs. Replication usually occurs at the transaction level, meaning that the replication mechanism plays by the same rules as the database itself, either a database operation is copied to the backup site or it isn't, no halfway copies. The other side of the flexibility coin is that most database replicators are very specific to a database vendor, so choices of replication technology might actually drive the database selection. Secondly, database replication environments are currently very complex to set up, requiring the administration of an RDMS at the sending side and a separate DBMS on the receiving end plus some configuration for the link itself. While database replication does help achieve the goal of providing a copy of data in another site to use in the case of disaster, moving back to the original location could be quite a chore, involving significant reconfiguration of the database replication logic.
Moving to a more general solution, a number of products exist to help mirror files from one site to another. In this model, as files are modified, redundant copies are pushed to the backup site. File replication is much simpler to set up than a database replicate and might be just right for those businesses who only need to preserve file-based data. There are a fair number of applications on the market that dictate the use of a raw disk drive, one that has no file system and hence no files. For these cases, file replication is not a viable solution.
This is where wide-area disk replication technology steps in. It is possible to maintain a network disk-mirror, an exact clone of your current data that resides a vast distance from the first copy. Network disk mirroring ensures that each write operation to a disk is copied in another location. Network disk mirroring is the broadest solution described. It works for any type of data.
Let's get back to our example, now including data replication. Every time the on-line catalog gets updated in Trenton, the copy in Los Gatos follows suit immediately. The vendor can update their disk with confidence that another copy of everything they do is safe and sound far from the local power grid, fault line, flood plain, and any other variable that is out of the direct control of the computing operations center.
It's well and good to have a copy of the data elsewhere, but if a business is to continue with minimal if any downtime, the switch to the other site must be rapid, ideally automatic. This requirement dictates that more than the minute-to-minute data be mirrored. The business must also have the capability to enable a second copy of whatever server-based computing resources allow the end-user to access that data. This is handled by wide-area clustering. Taking the modern high availability cluster, an architecture that has become increasingly prevalent in the '90s, and spreading the elements of the cluster across a wide geography yields a system where failure of an entire site can be detected and remedied.
How it all fits together
Now we have the pieces of real wide area real-time disaster recovery, so we'll take a look at how those pieces all fit. Let's say that vendor.com has a site in Trenton, a site in Los Gatos, a corporate fiber high bandwidth network link between those two locations, a wide-area disk mirror, and a geo cluster linking the two sites.
1. The Trenton site (www.vendor.com 126.96.36.199) goes down (power failure).
2. The wide-area cluster detects the failure within seconds and initiates automatic procedures to switch to the Los Gatos site.
3. Network messages are sent to the DNS servers that are distributed around the internet, these all start handing out the new location for ww.vendor.com (188.8.131.52).
4. The wide-area cluster starts new copies of the server processes that will feed the web.
5. The business is back on-line. Their data is up-to-date, their window of downtime, most likely in the minutes range.
This combination of wide-area cluster and data replication offers broad flexibility. Consistency mechanisms within the best network data replicators track all pending operations to the remote machine should it be temporarily unavailable or should the intervening link drop out of service. This makes it a simple matter to restore operations to the original site when desired. It also makes practical the wholesale migration of a computer site for purposes of upgrade or maintenance.
Although both high availability clustering and data replication have been available for a number of years, the above scenario has only recently become an extremely viable one. A few important factors make it more practical now. The first is the proliferation of high speed, dedicated bandwidth network links within the enterprise. The second is the emergence of high availability clustering technology that can function across wide distances and use industry standard internet technology to link the two sides. Finally, a data replicator maintains the data at the alternate location.
Wide-area clustering combined with replication provides a responsiveness to disasters that is unmatched by existing solutions.
Businesses have a wider arsenal of technology to help stay on-line through a disaster than ever before. The possible solutions span a spectrum defined by both data currency and switchover time. At one end of the spectrum is tape backup. At regular intervals all data is recorded to tape and stored in a safe place. In the event of a catastrophe those tapes are brought forward and applied to a new installation to create a new incarnation of the original site. Obtaining a clean tape backup typically entails bringing the computer system to an idle state, not an attractive proposition to a site running 24x7.
A much better solution is data replication. This solution falls short in that while a copy of the data exists, no ready facilities are present to restart the applications using that data. It does greatly surpass tape solutions in that the system need not ever be idled.
Finally, wide-area clustering adds that automatic logic recovery missing from the network disk mirror solution. Businesses can feel confident that not only is their data safe in another location, but they can expect a rapid and successful fail-over to the standby site. Wide area clustering can be combined with more traditional high availability clusters to yield very robust environments capable of serving loss of any component within a site or the loss of the entire site itself.
Businesses now have a new alternative to tape backup to stay on-line. Geo failover allows them to continue operation in the face of total site loss. Clients typically are unaware of the site transition. Restoration of the original site at a later time is painless, and the impact to ongoing computing activities is minimal.
Technology to the Rescue
The requirement for data integrity and quick failover is addressed through Geographic Clustering and Geographic Mirroring. Today's high speed WAN interconnects make it feasible to maintain an up-to-date copy of your business data far away from the original site for use in disaster recovery.
The combination of high availability clustering technology, network disk technologies and the Internet, allow businesses to realize totally automatic failover to the hot site when needed. Companies can resume business without needing to roll tape or reconstruct the lost operating environment.
Benefits of Geographic high availability:
- Quick business recovery
- Zero loss of data
- Minimized disruption to customers
- Easy transition back to original site
- Advantage over non-similarly equipped competitors
What is Geographic Clustering?
Geographic clustering extends loosely coupled clustering technology to encompass two physically separate sites. Each site maintains an updated copy of essential data and runs key applications, ensuring that mission critical computing resources remain continuously available at all times, even if an entire site is disabled. Using two sites prevents an individual site from being a single point of failure within the cluster.
Geographic Clustering differs from local high availability solutions by enabling three distinct levels of configuration: hot standby, mutual takeover, and concurrent access.
Hot Standby: A Hot Standby configuration provides the ultimate in business recovery planning. This configuration has site-similar data processing equipment and applications installed and operational at a geographically separate site. In the event of a failure, the entire disabled site, both data and human resources are relocated to the Hot Standby to continue business operations. Transfer of operations can be within a couple of hour to minutes.
While more expensive than other configurations, Hot Standby offers business recovery capabilities for companies that rely completely on the availability of data.
Mutual Takeover: Sometimes a business recovery plan needs to encompass multiple geographic sites. A finance company with operations in Boston, Chicago and San Francisco can implement a Mutual Takeover configuration to ensure business recovery should any single site fail.
Mutual Takeover enables business operations to automatically transfer to one or more secondary sites should a failure occur. While an earthquake in San Francisco may disable local business operations, Mutual Takeover would enable the Boston and Chicago data processing centers to absorb the failed applications and data, maintaining seamless customer service and support.
Concurrent Access: The Concurrent Access configuration provides a shared single disk image by various nodes across the geographic cluster. This configurations offers inexpensive data availability for multiple geographic locations. Data is continuously written to the single disk from all nodes within the cluster. In the event of a failure, operating nodes can access this disk for an up-to-date copy of data and applications.
What is Geographic Mirroring?
Geographic Mirroring, often called GeoMirroring, ensures that all data written to disk by a business is also written to a remote site. Network disk mirroring technology typically keeps track of differences between the two sites involved in a mirror. Therefore, after a failover, it is frequently possible to later resume operation at the original site without needing to copy all data back. Usually only the changes made since the failover need be transferred.
As with Clustering, GeoMirroring provides both synchronous and asynchronous geographic mirroring to ensure that even if a site fails, an up-to-date copy of the data is available to keep the business in operation. With synchronous mirroring, data is simultaneously written to both sites before control is given back to the application to perform the next transaction and write. Synchronous mirroring ensures that the data between geographic sites is always identical. Since data Is concurrently written between the two sites there is a higher degree of information availability with less chance of data loss.
Asynchronous mirroring writes the data to both sites at the same time, but only the write to the local site has to be complete before control is given back to the application. The remote site is allowed to 'lag' behind the writes to the primary site. While enabling faster operations, there exists a small window of data-loss opportunity should a failure occur.
Cliff Spencer is the Chief Technology Officer for CLAM Associates. He can be reached at: (617) 621-2542 or by e-mail: firstname.lastname@example.org