We have observed an increase in the number and variety of customer facing applications and information sources. Customer facing applications are those that a business customer interacts with either on-line, over the phone, or at an ATM machine. These applications are being implemented by businesses in order to increase business efficiencies, enhance productivity, and hopefully lower costs (both to the customer and for the business). Through customer facing applications customers interact more directly with the corporate information resource than at any other time in the history of modern business. Any service outage experienced by users of this type of application will have a dramatic impact in terms of lost revenue, lost transactions, and customer defection to competitors.
The role of risk assessment / analysis in disaster recovery and in fault tolerant information systems is to minimize the risk associated with either a disaster or fault / system failure. Additionally, both achieve the most benefit by totally avoiding a disaster or fault. When a business cannot carry on its work because the information system, or part of the information system has failed, the impact will be one or more of the following:
1. An inconvenience;
2. A temporary loss of productivity and revenue;
3. A severe impact to the business’s financial health;
4. A threat to public and personnel safety.
When situations two, three, or four occur, the event is determined to be a disaster and businesses should have a plan to either recover from it (Disaster Recovery) or avoid it in the first place (Disaster Tolerance/Fault Tolerance). The choice of which way to go must be based on a most realistic assessment of the potential severity of the consequences of the disaster as compared to the cost in resources and money to avoid or recover from it.
As organizations of all sizes increasingly depend on information technology, the value of the data, information, and services delivered via fully automated computer systems will continue to increase dramatically. Implementing a procedure of regularly backing up data and applications, either to tape, disk or a remote site is one proven way to provide an avenue for recovery from a catastrophic loss or disaster. However, the preferred and most effective solution is to avoid the risk altogether and assure that the data and computer will always be available!
Fault Tolerant Information Systems, when properly implemented, allow applications to continue processing without impacting the user, application, network, or operating system in the event of a failure, regardless of the nature of that system failure. All fault tolerant information systems use redundant components running simultaneously to check for errors and provide continuous processing in the event of a component failure. However, to truly meet the requirements of mission critical applications such as data servers, network servers, and web servers, fault tolerant information systems must satisfy the following requirements:
1. The system must uniquely identify any single error or failure;
2. The system must be able to isolate the failure and operate without the failed component;
3. The failed system must be repairable while it continues to perform its intended function;
4. The system must be able to be restored to its original level of redundancy and operational configuration.
We have developed a series of best practices guidelines for implementing true information system fault tolerance. These practices are based on years of research, user interviews, and widely accepted concepts of systems and information theory.
Basic computer theory tells us that system reliability can be improved by appropriately employing multiple components (redundancy) to perform the same function. Redundancy can be applied, and therefore should be considered, in terms of both time and space. For example, to improve communications through a noisy phone, one can repeat the message several times until the message gets through. The message takes more time to get through the noisy phone than it would take through a clear phone, but it gets there. Alternatively, having two phones, each carrying the same conversation, provides better reliability than one phone. If one phone fails, the other phone can still carry the conversation. The downside to this redundant approach is that it takes twice as many resources to get the message through or an extra (redundant) communication device (phone). No matter how you look at it, reliability requires redundancy, and redundancy expends either time or resources, both of which are not free. Furthermore, redundancy is only the starting point. It provides the basis on which one can build a reliable or continuously available information system. In order to provide the most complete protection, seven additional and critical steps must be taken.
Minimizing single points of failure provides the basis for insuring fault tolerant information systems. In order to minimize single points of failure in any system redundancy must be applied, as appropriate, in all aspects of the computing infrastructure. The authors have heard the war stories of system managers who were careful to run dual power cords to their computer systems, but unfortunately ran them through the same wire channel, creating the opportunity for an unaware service person to accidentally dislodge both cords, while servicing the system. Some options for avoiding single points of failure include using alternative power sources and RAID disk subsystems to protect the system from being brought down by the failure of a either a disk drive or power supply. The ultimate application of this principle is to duplicate a complete physical facility at a different geographical location to provide a disaster recovery site.
The trade-off between availability and cost should be analyzed during the planning and implementation phases of an information system. For example, it costs more to run a system from multiple power sources or double up on the amount of disk used for data storage. A primary consideration is the cost of a highly available system as compared to a conventional system. While some highly available information systems can cost as much as twice, and some fault tolerant system as much as six times that of a standalone system, the cost of these systems is small in comparison to the opportunity cost associated with a service outage. In general, the direct and indirect cost of system downtime should determine the amount of investment to be made in system availability requirements, along with the nature of the application and the end user’s needs. (For additional information concerning availability see the Availability Classification and Availability Example Sidebars at the end of this article.)
An often overlooked factor that is important to all parts of the system is capacity planning or the analysis of the performance of the various system components to assure the necessary performance is delivered to the users. A number of questions need to be addressed during this process such as network loading, peak and average bandwidths required, disk size, memory size, and the speed and number of CPUs required. Care must be taken to address the interactions of all system hardware and application software under the expected system load. This is particularly important when considering high availability failover configurations, where the interrelationship of all applications and middleware must be fully understood. Otherwise, the system could fail over the specific user application, but not bring all the necessary support middleware such as the database.
Serial paths are comprised of multiple steps, where the failure of any single step will cause a complete system failure. Serial paths exist in operational procedures as well as in the physical system implementation. Application software is often the most critical of serial elements because an application software bug cannot be fixed while in operation. It can be restarted or rebooted, but it cannot be repaired. A well-written application can minimize the opportunity to lose data by employing techniques such as checkpoint and restart. Checkpoint and restart stores intermediate compute results when passing data from one process to another in order to avoid a serial failure.
Selecting and managing the software used for critical applications are important steps that must not be overlooked. First, utilities and applications must be stable, as determined by careful selection and testing. Many IT organizations test new versions of critical applications in an offline simulated environment or in a non-critical part of the organization for several months before full deployment to minimize the probability of crashing a critical application. The software industry has promoted software upgrades as the pathway to computing heaven, however the rate of release and complexity of upgrades often exceeds the ability of IT managers to fully qualify one upgrade before the next upgrade is out. The pressure to upgrade should be resisted; the installation of an unstable application can be more devastating than a physical server meltdown. Likewise, even though management may be pushing to consolidate distributed applications onto fewer servers, it should be avoided. Consolidation can jeopardize the availability of the critical applications. For example, the server can become unstable, and the network may not support the new traffic load on the server. Either of these events can be disastrous.
The physical aspects of the computing environment must be considered when establishing a reliable and safe information system environment. The primary components of the computers and network must be addressed initially. Then, consider the physical environment of space, temperature, humidity, and power sources. Most of the time these factors only get attention when building a new facility and are totally overlooked when making small system changes, installing new systems or upgrading current systems.
Another key and yet often ignored element in the management of the physical environment is the actual physical security and access. It is a basic element of protection of the business’s information assets. Allowing casual access to critical information systems can result in inadvertent or even intentional system outages.
The processes and procedures used in managing the information infrastructure should provide maximum system availability with minimum interruption in service in the event of a failure. This includes access control, backup policies, virus screening / recovery, staffing, training, and disaster recovery. These processes and procedures should be documented and updated regularly. They should also be exercised and revised, if necessary, at least once a year. Exception procedures are elements of last resort and must be complemented by proper day-to-day operational processes which ensure the proper allocation of system resources via application and operating system tuning. In too many cases processes and procedures are ignored until a crisis. Then it may be too late to avoid a system outage. Finally, remember that even a well-documented process has little value if the operators and system managers have not been trained and updated on a regular basis.
The overall architecture of the system, including the major functions of each subsystem and component, must behave as an integrated whole to accomplish the business goals of the enterprise. The design of any system requires the application of trade-offs and design decisions to implement the architecture. The architecture and design decisions should be documented and managed on an ongoing basis to maintain the system’s architectural and design integrity and also to provide a means for transferring knowledge to new personnel.
Commercially available fault tolerant computers have been around since the 1980s. Historically, they have been characterized as expensive to buy, proprietary in nature, and complex to manage. Today, fault tolerant systems are not necessarily proprietary, but they still tend to be the most expensive. For example, fault tolerant systems based upon the Unix operating environment are more open and somewhat easier to manage, but they can cost four times a standalone solution. Recently, with the advent of commodity PC Servers, the NT operating system, and new hardware and software technologies for high availability clustering and fault tolerance, the paradigm is shifting. It is now possible to purchase an NT based fault tolerant system that only costs two to three times a conventional computer, and offers significant savings by avoiding the downtime costs.
In this new environment where fault tolerant solutions are relatively inexpensive and easy to use there will no longer be any barriers to the implementation of the most appropriate availability level solution.
The key to deploying a disaster or fault tolerant information system is to assess all the risks and then take the most appropriate action(s). In the case of making your computer applications and data fault tolerant, we recommend IT managers consider the following:
1. Begin with redundancy in hardware and software,
2. Minimize all single points of failure,
3. Choose the right server availability for the job,
4. Employ thorough capacity planning,
5. Eliminate hardware and software serial paths,
6. Carefully select and manage software,
7. Consider all the physical issues,
8. Apply good processes and procedures,
9. Maintain consistent architecture and design control.
The fundamental guideline is to not be distracted by the cost to implement the proper solution, but rather look at all the cost factors including the cost of loss of business and customer good will. They provide a basis for IT managers to determine the most appropriate allocation of resources for the highest level of availability consistent with the mission of the enterprise and the cost of downtime.
|Harvard Research Group has defined computing environment in terms of the ultimate impact on the activity of the business and consumer of the service. The six Availability Environment Classifications (AEC) described below define availability in terms of the impact on both the business and the user:|
The authors have provided some typical user examples of the various levels of availability requirements:
Disaster Tolerant Computing: A US financial institution’s computer-based applications can never be allowed to fail. They have indicated that a minute of downtime cost them over $100K. The cost of the system being down far exceeds what it would cost them to create and maintain a remote mirrored site.
Fault Tolerant Computing: A cosmetics manufacturer has its entire inventory continually on-line and accessible via the Internet. Since they are an International company, the inventory needs to be available seven days a week and 24 hours a day. A minute of downtime not only cost them thousands of dollars, but could result in lost customers and business opportunities.
Fault Resilient Computing: A large food service company operates its inventory system for 1200 outlets throughout the United States. Inventory changes must be entered and checked 24 hours a day, 7 days a week. The centralized database of food items and supplies must be maintained even as store clerks are entering new transaction. The company can sustain short periods of system unavailability, but nothing more than 5 minutes, or the inventory control and accuracy will be compromised.
High Availability Computing: A trucking company uses a computer system to record, dispatch, and track the activities of its fleet of 125 trucks. During the early morning hours, all 125 truck drivers have to get their deliveries for the day. It is absolutely essential that this activity occur between the hours of 5 a.m. and 8 a.m., otherwise the business will grind to a halt. Should the computer system fail during this period the deliveries will not be dispatched in time to meet the days delivery schedule and the companies customers and their customers will be lose money and goodwill. If the system fails at other times during the day, it is not a critical issue or catastrophic.
Reliable Computing: A small consulting firm maintains a database of market history and forecast data. The data is updated on a continuous basis and its safety and security are important to sustaining and growing the business. The firm has put in place specific processes and procedures for weekly backups of the data including RAID disk storage subsystems. The consultants and their staff typically use their computers during regular business hours (8 a.m. to 5 p.m.) While they do not ever want their systems to fail, systems outages during non-work hours that do not affect the integrity of the data are not a serious issue.
Conventional Computing: A small insurance agency maintains client data online for up to three years. The safety and security of their client data on disk is important, and a process is in place to perform backups. The insurance agents and the staff typically use the computers 9 a.m. to 5 p.m., Monday through Friday. While the agency president does not want the system to fail, he feels it is possible to work around any outages that make the system unavailable or affect the integrity of the online data.
Robert M. Glorioso, Director, President and CEO of Marathon Technologies has had 30 years experience in systems and computer technologies edu-cation, development, marketing and management. He has authored or co-authored four books, several papers, and three patents.
Robert E. Desautels, the founder of the Harvard Research Group is a senior industry consultant with extensive experience in both tactical and strategic sales and marketing. Mr. Desautels has consulted and written on topics including the world market potential for new and emerging technologies, highly available servers, system software and utilities, and storage devices.
If you would like additional information on High Availability or Fault Tolerance, here are some places to look: Harvard Research Group at www.hrgresearch.com and also Marathon Technologies at www.marathontechnologies.com.