The term 'assured availability', coined by Marathon Technologies Corporation to describe their architecture that delivers full fault tolerance and disaster tolerance, best describes the full measure of this yardstick. How this translates into technology and how this technology compares to technology delivering other levels of availability depends on the types of faults or disruptions you want to tolerate. This is particularly true for 9-1-1 systems, which by definition, users will most likely call into service during the most extreme circumstances, such as during some sort of disaster or catastrophic event or mishap near the deployment locale. In other words, defining levels of availability and fault tolerance for 9-1-1 systems, and as a consequence for all other systems in this global and Internet world, depends on defining the breadth and depth of requirements.
Several missteps typically accompany the analysis of requirements for fault-tolerance and the level of fault-tolerance any particular technology delivers.
First and foremost, most people don't assess their risk or plan adequately for disruptions. According to the available literature, roughly 80% of computer dependent entities that suffer from major system disruptions have not recovered fully many months after an incident. It would be nice to assume that organizations running 9-1-1 systems, people that deal with disasters and disruptions on a day-to-day basis, plan for such eventualities. Though it suffered the unfortunate incident described above, New York City has an exemplary operation and plan for dealing with disasters and disruptions, including geographically separated and redundant computer centers. That may not be the case for all emergency dispatch operations. Budgetary constraints, very common for state and local governments, may preclude such elaborate preparations. In these circumstances, even a cursory analysis and plan for continuing most critical operations will prove useful. Moreover, 9-1-1 system operations must be able to accommodate disruptions over long periods of time. A survey conducted by ICR Survey Research Group regarding vulnerability of computers found that 66% percent had suffered some sort of disruption of over an hour. A closer examination of this group revealed that one in five of these disruptions had lasted longer than 24 hours.
Second, on the list of missteps is the myopic scope of most analyses. By all accounts, any review and plan ought well to take into account all possible sources of faults/disasters, including hardware, software (the operating system and the application), people, operations, environment and communications. Any 9-1-1 system worth bearing the label of assured availability must tolerate all manner of possible interruptions, ranging from a cup of coffee spilled on a server, to floods, fires, hacks and riots. Though not a 9-1-1 service in the conventional sense, a Department of Defense agency (that will remain nameless), from which we all would want a quick response, lost message communication capabilities for nearly three days as the result of a simple hard disk crash complicated by a black comedy of errors, missteps and procedural (or lack thereof) errors.
Unfortunately, most organizations typically underestimate the impact of a significant computer disaster, not to mention the interrelationships and inter- dependencies in computer centric operations. In addition, even individuals armed with this knowledge often perceive that disasters are something, which happen to other people. Therein lies a paradox. The purpose of a disaster and fault tolerant computer systems, as well as disaster recovery plans in general, is that if the unthinkable does happen, the system, in the broadest sense of the word, will continue to support critical activities, minimize loss, and aid a speedy recovery. And nothing is more critical than emergency dispatch.
Third, popular wisdom posts that the ability to tolerate a computing disruption depends on the purchase of reliable hardware. Although this is important, evidence suggests that hardware failures only make up a minority of disasters. One study indicated that only 22 percent of disasters were related to computer hardware malfunctions. Power surges, earthquakes, floods, fires, explosions, riots, etc., all situations that will demand continuous operation of 9-1-1 systems, cause the vast majority of disruptions. This indicates that any technology purporting to deliver assured availability, ought well tolerate disruptions emanating from outside the four walls of the computer room.
Finally, most people think of fault tolerance in terms of 'recovery' rather than continuity. Moreover, organizations seem to consider 'recovery' less expensive and easier to accomplish than 'continuity.' There is need for new thinking here. Exploding dependence on networking and computing increases the costs and consequences of disruptions and as a result our sensitivity and vulnerability to them. To draw an analogy that conjures visions of the need for a 9-1-1 system, managing a mission-critical computing system is like driving an 18-wheeler 75 miles an hour in bumper-to-bumper traffic. If you have a blowout, would you rather recover or continue rolling down the road? It seems that the trucking industry has the right idea. Redundant, assured availability tires on redundant, assured availability axels are far less expensive, tidier, and easier to manage and maintain than recovery from a high-speed truck wreck.
Harvard Research Group (HRG), an analyst firm that has built its reputation on taking the user perspective when assessing the levels of availability provided by technology, concurs with the notion that organizations must adopt systems that deliver the degree of availability that is appropriate for each task. To facilitate this concept, HRG believes that a clear understanding is needed between vendors and customers about how each component contributes to reducing system outages. Unfortunately, the dialog has been muddled because every vendor defines its products in terms of its own definitions for such labels as 'fault tolerance,' 'fault resilience,' and 'continuous availability.' Therefore, to help alleviate the confusion, HRG has defined availability in terms of the impact a system being unavailable to perform work has on the activity of the business and consumer (end user) of the service; rather than the technologies used to achieve it.
- Fault Tolerant (AE-4) - Business functions that demand continuous computing and where any failure is transparent to the user. This means no interruption of work, no transactions lost, no degradation in performance, and continuous 24x7 operation.
- Fault Resilient (AE-3) - Business functions that require uninterrupted computing services, either during essential time periods, or during most hours of the day and most days of the week throughout the year. This means that the user stays on-line. However, the current transaction may need restarting and users may experience performance degradation.
- High Availability (AE-2) - Business functions that allow minimally interrupted computing services, either during essential time periods, or during most hours of the day and most days of the week throughout the year. This means users will be interrupted but must quickly log on again. However, they may have to rerun some transactions from a journal file and they may experience performance degradation.
- Highly Reliable (AE-1) - Business functions that can be interrupted as long as the integrity of the data is insured. To the user work stops and uncontrolled shutdown occurs. However, data integrity is ensured.
- Conventional (AE-0) - Business functions that can be interrupted and where the integrity of the data is not essential. To the user work stops and uncontrolled shutdown occurs. Data may be lost or corrupted.
Disaster Recovery (DR) is a horizontal availability feature that is applicable to any of the Availability Environments (AEs). It provides for remote backup of the information system and makes it safe from disasters such as an earthquake fire, flood, hurricane, power failure, vandalism, or an act of terrorism.
(Implicit in Harvard Research Group's definition of Fault-Tolerance (AE-4) is geographic or site disaster tolerance.)
The descriptions of prospective operational environments above and the costs and consequences of downtime dictate that 9-1-1 systems need technologies that satisfy HRG's definitions for Fault Tolerance (AE-4), which implies geographic or site Disaster Tolerance, and Disaster Recovery (DR).
This is what is meant by assured availability.
Assured availability solutions provide protection against lost revenue, threats to life & limb, liability exposure, lost productivity, customer dissatisfaction, regulatory violations, damage to assets, career damage. Assured availability solutions provide this protection and continuity through every moment and at each moment. They include a full complement of the best hardware, software, operations, environment, and communications intended and designed to deliver continuity of service for computer dependent operations, through all manner of disruptions, large and small, from any source. Finally, it is preferable from the perspective of cost and ease of use to base assured availability solutions on industry standard hardware, unmodified off-the-shelf operating systems, and shrink-wrapped application software.
In short, assured availability is both the platform and the solution that supports and enables continuity rather than recovery for 9-1-1 systems, something that you can bet your life on.
Craig Jon Anderson, Director of New Business Development for Marathon is responsible for the marketing and new business efforts of the company. With over 20 years of experience in the industry, his background includes high-tech business development with Java, CORBA, object oriented and RPC client-server development tools, Internet services, fault tolerance, and transaction processing. Craig's industry experience ranges from publishing and telecommunications to healthcare and banking.