For example, in a high volume manufacturing plant we usually find a process tracking system run by the IT organization. The system tracks labor, raw materials, finished goods, work in progress, and many other details which are vital to the company. Many of the common disaster causes such as flood, weather, fire, and loss of power, that could stop plant’s operation are also likely to force a shutdown of the computer system.
The plant may be the company’s only manufacturing facility, it may make critical parts of a larger product, or it may be just one of the many sites that turn out large volumes of the finished goods. In all cases, when the plant is not running, it is costing the company money each minute, hour and day that it is down. It is important to estimate that cost since it varies for each business and is needed to help justify the disaster tolerant effort. A production line may be able to recover from this lower volume by having employees work overtime, but a funds transfer system in a bank cannot use the same strategy to recover from delayed transactions. Patient medication tracking in a hospital may be able to fall back to files and written charts hung on the patient’s bed, but an on-line securities exchange cannot use paper and pencil to track the thousands of trades per hour that occur each day. Each case is different, and there may be more than one case within a company, especially when multiple businesses in a company share a key computing system. In our example of the manufacturing plant, while the clean up may take hours or days, the warehouse could continue operations if they still had access to computers to manage raw material deliveries and finished good shipments. In most cases, the sooner that orders can be shipped, the sooner the company can restore business operations. Having the computers operational will reassure customers, since order status can be checked, e-mail communications can be resumed and the web site can be back on-line.
Remote Copy Is The Lowest Form Of Disaster Tolerance
Making a continuous copy of the data is the first step in protecting your critical IT infrastructure. When one copy is physically located some distance from another copy, then we have our basic disaster tolerance capability. The distance between the copies will determine what disasters can be tolerated. As Figure 1 shows, we can have the computer storing data on a local disk and have a disk farther way to hold a duplicate copy of the critical data.
There are many different methods and technologies that can be used to create and maintain the remote copy. The replicated data can originate within the application, the copy can be created by mirror/shadow software in the operating system, or the copy can be created by functions built into the storage controllers. The copy may be maintained synchronously or asynchronously, where the data write operation may or may not be considered complete until both the local and the remote copy are on the non-volatile surface of a disk. In its crudest form, remote copy could be implemented using the off-site daily backup and using transaction journaling to a disk that is remote from the original computer system.
Whichever method is chosen, all of these implementations allow the company’s critical data to be held at a safe distance away from the original site in case the data storage equipment is destroyed or inaccessible. In order to restore operations with the remote copy strategy, the data is required to be loaded onto suitable disks and the application programs adjusted to use the new copy of the data.
If the remote copy is outside the computer room, but in another room of the same building, the computer system is protected from disasters that occur in the computer room only, such as a fire or an explosion. On the other hand, a flood or an earthquake would likely affect both rooms of the same building, therefore, moving the data out of the computer room but keeping it in the same facility, only protects the business from small disasters.
But what if the original computers remain inaccessible or were destroyed by the same forces that destroyed the original copy of the data? If processing depends on some special or custom devices, then it will be difficult to resume computing without them. For the data to be useable, the computers, application programs, and user access network need to be operational and accessible. Recovery time with remote copy can be measured in as little as a few hours, but it will usually take one or more days since access to the original computer room or a surrogate installation can be lengthened by the original disaster.
Remote Computing Improves The Disaster Tolerant Solution
Improved disaster tolerance recognizes that both data and compute power are needed at an alternate site. With alternate computing resources in place, recovery of the IT infrastructure can be improved typically 10 times faster than with remote copy.
The time that the original data center is inaccessible or the time it takes to make an alternate site operational can range from five hours to 50 hours (assuming custom equipment does not need to be ordered). With an alternate site, the recovery process is much more straightforward and documented recovery procedures do not need to handle the complicated process of locating a computer to use. This alternate computing site does not need to be running the same application programs and, in fact, the alternate systems can normally be used for wholly different functions, such as running less critical applications, software development and testing, or even as training systems. As long as the alternate site has all of the equipment necessary to meet the minimum level of performance, the company is operational and additional or replacement equipment can be installed if the original site is not expected to be available in the near future. It may also be possible to relocate equipment from the original site to the alternate site if the equipment is functional but inaccessible. In Figure 2, we see a configuration example, which has computing and a copy of the data at both sites. Note that although the computer at a site is drawn to show it is physically close to the alternate site’s second copy of the data, they are not actively connected.
To continue with the factory example, the amount of processing power needed to run the warehouse should be significantly less than needed to run the production floor and warehouse combined. This would allow the computers at the second site to be smaller and less expensive.
On the other hand, recovery steps needed to restore operation typically include a system reboot, database reload, application re-vectoring, user access rerouting, and other steps necessary to adjust computing operations to suit the alternate site. If the alternate site were running less critical applications, then those programs must be shifted or shut down in an orderly fashion. The time it takes to perform these common “failover” steps can range from a few minutes to a few hours.
As long as the cost of being out of operation for up to a few hours can be tolerated, compared to the infrequent occurrence of a disaster, then the improved strategy meets the business needs. However, there are some businesses that cannot accept hours of continuous outage. Real-time systems, such as nuclear power plant control and air traffic control systems are obvious cases that cannot tolerate long outages. Other cases include a catalog order entry system, a police and fire dispatch system, or a hospital management system, where significant operation outages cannot be tolerated. These businesses should consider the next configuration.
Wide Area Clustering Provides The Ultimate In IT Disaster Tolerance
With data replicated at both sites and sufficient processing capacity present, applications can remain on-line when using an actively clustered configuration. Active clusters allow simultaneous application execution on all of the computers in clustered system. When configured as a wide area cluster, the application can remain operational without regard to the physical location of the hardware. In this configuration, if a site becomes inoperable, the remaining workload continues at the remaining site(s). Failover efforts typically range from minutes if you want to optimize parameters for the new workload, to no time at all. This is because little or no manual intervention is needed to keep the application up and running on the remaining systems. As shown in Figure 3, we allow concurrent access to the full set of disks from the systems at both sites.
Active clustering relies on two specific technologies. The first technology provides a coordinating software function to manage access to the data so that multiple copies of the application can read and write to the same files without corrupting the contents. The coordination is usually provided by facilities in the operating system or can be built with special procedures that are written by the application developer. Recent advances in high-speed wide-area communications channels such as FDDI, fibre channel, and ATM or other high-speed packet switching services adds the long distance communication technology needed to implement this solution. The high availability of clustering, when added to the disaster tolerance of wide area configurations, produces the ultimate in IT solutions. Any outage, whether it is a single system or a whole site, will continue to function after the loss of part of the overall system.
In conclusion, we see remote copy provides only limited protection and is the worst solution in terms of recovery time. Remote computing offers a superior solution in cases where businesses have sufficient capital budget for the equipment, and the forethought to implement a faster recovery. For those businesses that recognize the criticalness of their computing infrastructure, and have chosen to have the business “ride through” a site outage, wide-area clustering offers an outstanding solution with virtually the same equipment investment as in the remote computing solution.
Robert Lyons is a systems consultant at Resilient Systems Inc. (www.resilientsys.com). He has over 10 years experience in designing and implementing disaster-tolerant configurations worldwide.