Adding Disaster Tolerance to Computing Systems
- Published on October 25, 2007
A Passively Redundant system provides access to alternative components which are not associated with the current task and must be either activated or modified in some way to pick up the load of the failed component. The consequent transition is noticeable and may even cause a significant interruption of service.
Subsequent system performance may also be degraded. Examples of passively redundant systems include stand-by servers and clustered systems. The mechanism for handling failures in passively redundant systems is to fail-over to an alternative server. The current state of the failed application will be lost and the application must be restarted in the other system. The fail-over and restart processes typically cause some interruption or delay in service to the users.
Thus, passively redundant systems such as stand-by servers and clusters provide 'high availability', but and cannot deliver the continuous processing usually associated with 'fault tolerance.'
An Actively Redundant system provides at least one alternative processor which runs concurrently on the same task at the same time and, in the presence of a failure, provides continuous service without a noticeable interruption in service. The mechanism for handling failures is to compute through a failure on the remaining system or systems. Because there are at least two processors looking at and manipulating the same data at the same time, the failure of any single component will be invisible to both the application and the user. In this article we will discuss technology for providing fault tolerant computing during natural disasters.
Failures in systems can be managed in two different ways each providing a different level of availability and very different restoration processes. The first is to recover from failures as in passively redundant systems and the second is to mask failures so they are invisible to the user as in actively redundant systems.
Systems that recover from failures employ a single system to run the user applications until a failure occurs. Once a user, system operator or a second system that is monitoring the status of the first detects a failure (in several seconds to several minutes), the recovery process begins. In the simplest type of recovery systems, the operator physically moves the disks from the failed system to another system and boots the second system. In more sophisticated systems, the second system, which has knowledge of the applications and users running on the failed system and a copy of or access to the users data, automatically reboots the applications and logs on the users. In both cases the users see a pause in operation and, of course, lose all their work between the last time something was saved and the time of the failure. Additionally, they must restart their application. Applications that have been modified with knowledge of the system architecture can reboot automatically providing a smoother recovery.
Systems that recover from failures include:
1. Automatic back-up, where selected files are copied periodically onto another system which can be rebooted if the first system fails,
2. Stand-by servers, that not only copy files from one system to another but keeps track of applications & users.
3. Clusters, as currently used in industry lexicon, can mean anything from a stand-by server to a performance-scaling array of computers with a fault tolerant storage server and a distributed lock manager.
Systems that mask failures employ the concept of parallel components. Here, two components, each capable of doing the job, are deployed doing the same job at the same time. If one should fail the other continues thereby improving overall system reliability. An example of a simple and common parallel technique places two power supplies in one system. If one power supply fails the other keeps the system operating. More robust masking systems replicate everything in the system making it transparent to all single failures. These truly fault tolerant systems detect most failures in less than a second and offer other features that facilitate 7 x 24 operation such as on-line repair and upgrade capabilities.
Systems that will provide continuous operation in the wake of a natural disaster, then, must be truly fault tolerant. This means they may have the following six characteristics:
1. No Single Point of Failure
2. No Single Point of Repair
3.Identification: System identifies any errors or failures before any data can be corrupted.
4. Isolation: System isolates errors or failures so the system can continue to operate in the presence of the error or failure.
5. Repair: The failed component can be repaired while the system is in operation running the users applications.
6. Resynchronization: The failed subsystem is brought back into the system configuration thus restoring full functionality with minimum or no interruption of service.
Further, since the occurrence of a failure in these systems is not visible to the user, a method for notifying the operator and/or the service personnel must be provided so repair of a system that becomes vulnerable after a failure can be initiated. The users of fault tolerant systems are typically unaware that a failure has occurred and has been repaired.
Architecture for Site
All of the above solutions entail purchasing new server hardware. Marathon Technologies has developed fault tolerant computing solution for the Windows NT environment that adds site disaster tolerance to any Pentium Pro-based off-the-shelf servers. The Marathon system, which is transparent to both the operating system and the application, allows examination and comparison of the results of computations within the normal execution process.
First, consider the two basic operations that all computer systems perform:
1. Manipulating and transforming data.
2. Moving data to and from mass storage, networks and other I/O devices.
The Marathon configuration separates these functions both logically and physically onto two separate CPUs, Figure 5, and interconnects them with high-speed PCI interfaces and fiber optics. The interface card, Marathon Interface Card or MIC, contains the drivers for sending data to and receiving data from two systems simultaneously and the comparison and test logic that assures results from two systems are identical. The pair of computers in Figure 5, called a tuple, constitute a complete system where the operating system running on the Compute Element is Windows NT Server and the operating system running on the I/O Processor is either Windows NT Server or Windows NT workstation depending on the way the I/O Processor is utilized.
Figure 5. Marathon Tuple - Building Block for a Fault Tolerant System
All I/O task requests from the Compute Element are redirected to the I/O Processor for handling, a 'division of labor.' The I/O Processor runs the Marathon software as an application that handles all the fault handling, disk mirroring, system management and resynchronization tasks. Since Windows NT is a multi-tasking operating system, other non-fault tolerant applications can also be run on the I/O Processor.
A fault tolerant system can easily be configured by coupling two tuples together as shown in Figure 6.
The two Compute Elements run Marathon's patented synchronization technology and execute the operating system and the applications in lock step. Disk mirroring takes place by duplicating writes on the disks on each I/O Processor thereby providing RAID 1 functionality without a special RAID controller. If one of the Compute Elements should fail the other Compute Element keeps the system running with only a pause of a few milliseconds to remove the failed Compute Element from the configuration.
The failed Compute Element can then be physically removed, repaired, reconnected and turned on. The Compute Element is then automatically brought back into the configuration by transferring the state of the running Compute Element to the repaired system over the high speed links and resynchronized. The state of the operating system and applications are maintained through the few seconds it takes to resynchronize the two Compute Elements thus minimizing the impact on the users. Note that this is very different from Cluster resynchronization described above. Further, if an I/O Processor fails the other I/O Processor continues to keep the system running. The failed I/O Processor can then be physically removed, repaired and turned back on.
Since the I/O Processors are not running in lock step, the repaired system must go through a full operating system reboot. After the Marathon software starts running, the repaired I/O Processor automatically rejoins the configuration and the mirrored disks are re-mirrored in background mode over the private 100 Mbit Ethernet connected between the I/O Processors. A failure of one of the mirrored disks is handled through the same process.
The network connections are also fully redundant and work as follows. Network connections from each I/O Processor are booted with the same MAC address and only one is allowed to transmit messages while both receive messages. In this way each network connect monitors the other through the private Ethernet. Should either network interconnect fail, it will be detected by the I/O Processor and the remaining connection will carry the load. The system manager will also be notified of the failure so a repair can be initiated.
Figure 6. Marathon Fault
Although Figure 6 shows both connections on a single network segment, this is not a requirement and each I/O Processor's network connection can be on different segments of the same network. The system also accommodates multiple networks each with its own redundant connections. The extension of the Marathon system to disaster tolerance only requires that the interconnects between the tuples be fiber and each tuple can then be up to one mile apart. Since the Compute Elements are still synchronized over this distance, the failure of a component or a site will be transparent to the users.
Disaster tolerant systems must offer true fault tolerance rather than just 'high availability.' A high availability system uses Passive Redundancy to 'fail over' a user to an alternative system, a process that can take minutes, and the user loses the state of the running application. A Fault Tolerant System uses Active Redundancy to 'compute through' a failure, a process that is totally invisible to the user, has no single point of failure or repair and can go through the Identification, Isolation, Repair, Resynchronization steps without losing the state of the user's running applications. A Disaster Tolerant system provides the same characteristics as a Fault Tolerant system in the presence of the failure of the site where half of the system is located.
Robert Glorioso is President of Marathon Technologies.