Minimizing the Cost of Computer/Network Downtime
Computers and networks have become an integral part of daily operations for business, education and government organizations alike. For those increasingly dependent on computers and networks for routine operations, downtime or data loss can be devastating, impacting earnings and even market valuation. This paper explores the costs and causes of downtime as well as ways to minimize such costs through downtime prevention, early problem detection and effective recovery capabilities. The Cost of Computer / Network Downtime The impact of downtime on an organization ranges from a minor inconvenience to an inability to perform necessary business tasks with resulting loss of productivity, revenue, and even customers and market share. One survey on downtime reported industry average numbers of $80,000 per hour, four hours average downtime and nine occurrences per year for a loss of nearly $3 million per organization per year. Another survey reported annual losses of $350 thousand to $11 million per organization with an average annual loss of $5 million. These surveys covered a variety of organizations with different sizes, in different industries and with different problems, but are useful to demonstrate how significant downtime costs have become. An organization trying to reduce its downtime costs needs to focus not on such averages but on its own particular cost structure. Estimating downtime costs deserves some care. Lost sales opportunities mean the loss of not just the profit on those sales, but of the entire gross margin which also helped pay marketing, administrative and development costs. The direct labor and benefit costs for idled workers will add to the cost unless that work can be deferred during the downtime and caught up afterwards without any overtime or extra cost. The loss of customers and market share has implications for the future as well. Generally the cost of downtime depends on the frequency and duration of downtime periods as well as the degree to which computer/networks have become "mission critical" to the organization. Below is an outline of some of the factors which affect the cost of downtime. Timing of downtime: During a fully manned shift Outside normal working hours During times of the day when orders are taken, products shipped, etc, During month-end closing or seasonal peaks Duration of downtime: Seconds, minutes, hours or days Operations affected (Are they mission critical applications or just staff tools?): Order entry (mail order, point-of-sale) Sales support Manufacturing operation support On-line shipping Corporate Web-site Accounting, Payroll, A/P, A/R Word processing Engineering calculations Training Speed of Response: Instant alert and response Delayed alert and/or response Third party vendor response time Ease of Recovery: Hardware repair time Fully automated recovery Semi-automated data recovery Manual restoration Causes of Computer / Network Downtime With improved hardware designs and major gains in component reliability, mainframe hardware failures have declined significantly. The larger portion of downtime is now related to network failures, database and applications software problems and human errors. In moving from central computing to distributed data processing and networks, new potential failure points have been created in small unmanned communications rooms or converted closets. Another source of downtime is the support equipment for the rooms these systems reside in. Surveys report between 10 percent and 30 percent of the downtime results from faults in facility environment and support equipment. This category is unique in that unlike the hardware, software and people errors, there is often a window of minutes to hours where the fault could be dealt with before computer system function is actually lost. Minimizing Losses Clearly it is preferable to try to avoid downtime altogether through careful system design, robust and well tested software, and good operating practices. Since computer installations and operators are not perfect, it is also prudent to structure the system to tolerate faults wherever possible and to automate, or at least facilitate, recovery when faults do occur. "Fail-proof" systems are a help although they can be more difficult to resuscitate when they do crash. Similar attention needs to be given to the support equipment which provides the environment necessary for reliable operation of the computer system(s). Even fail-proof systems cannot function without power, with excessive heat or when inundated with rising water. Full time monitoring for faults in the environment is important wherever early recognition of the problem would allow correction before the computer systems operation is compromised. Even when downtime cannot be prevented, early recognition is important so fault correction and recovery is not delayed. Full time monitoring of the computer-systems and networks themselves is also important. Some fault or error-messages on the operators console deserve to be dealt with promptly even if they occur during hours when the computer facility is unattended. For most computer systems in use today, it is possible to periodically confirm that the computer is responsive and functioning. Telephone equipment (and associated UPS units) which are essential to the functioning of the computer system and the organization also warrant monitoring. Many of the important considerations are listed below in outline form. Preventing Downtime Good System Design - Computer Hardware and Software Reliable and/or fault tolerant hardware Minimal number of critical points where one failure crashes entire system Thoroughly, tested and debugged O/S and applications software Mirrored or RAID storage On-line backup Integral journals or audit trails for crash recovery Utilities available for system health and data structure checks Good Design Practices Facility & Support Equipment Based on systems needs, have adequate control of: 1. Temperature & temperature rate of change 2. Humidity 3. Airborne particles (dust or smoke) Anti-static conductive floor surface Minimal need for personnel traffic in and out of main computer room UPS with sufficient capacity to permit orderly shutdown or transition to alternate power Operator and Administrator Training Zero tolerance for "cockpit errors" Prompt and correct response to problem situations Good Discipline Routine attention to storage and resource allocation Routine diagnostic checks of system and database structures Routine virus checks of system and all imported files Routine maintenance of system and support equipment: 1. Air conditioning system inspected and repaired; clean air filters 2. Generator fuel, oil, coolant & battery condition Thorough testing of new hardware and repaired hardware Rigorous testing of software patches, bug fixes, upgrades and revisions Routine backup of files, on-line if possible On-site spares (known to be in working order) for all critical equipment Detecting Failures Major system / network failures - instantly obvious in a manned facility Partial failures affecting only a few system functions - may go undetected for hours or even days Failures during unmanned periods - may go undetected for hours or days (Many weekend failures are discovered on Monday morning.) Faults in support equipment - may take minutes to hours to cause a system failure Early detection can reduce or eliminate downtime Recovery from Failures Applications designed or selected to minimize or eliminate the need for manual intervention for recovery Recovery procedures automated whenever possible Fault notification organized to minimize response time Procedures for recovery documented and tested, including: 1. Who is in charge? 2. What personnel are required? 3. Who needs to be notified of the problem? 4. Who needs estimates on when system will return to service? 5. What outside vendor support is available and how to call for it? Monitoring Practices Factors to Consider These factors should be taken into consideration when deciding what to monitor: 1. Organizational Risk How the organization is affected by downtime? The cost per hour of downtime When the organization is most vulnerable: Days? Nights? Weekends? 2. Past Downtime Problems Power unreliable Air conditioning system failures UPS units that come on-line unnoticed until battery is exhausted Peripherals going off-line Intruders Water leaks 3. Budget constraints What is the standard for return on investment or cost effectiveness? How does the return on investment for downtime cost control compare to competing investment opportunities? Is there a hard limit on available budget? Will additional budget be available in the future for upgrades? Last Major Downtime Disaster Organizations tend to be more motivated to deal with the problem while the memory of a major disaster is still fresh. If you cant get action now, wait until just after the next disaster. The goal, of course, is to act before to help avoid the next disaster. Parameters to Consider Monitoring This list is for critical rooms with a concentration of hardware and support equipment which may be manned part or all of the time. A subset would be appropriate for unmanned communications or network rooms. Temperature Within limits in all areas, including under floors Rate of change not exceeded Hot spot detection and balancing between A/C zones Humidity Within limits in all areas Air Conditioning/Chiller Systems Each unit functioning Fan failure Discharge temperature Water Detection Under computer floor From floor above computer room Sprinkler system/heads Under pipe work or drainage Around chillers From roof Smoke Computer Room Under floor Support areas Interface to: 1. Building fire detection system 2. Auto fire suppression system 3. VESDA system Main Power Voltage each phase (& neutral if required) Current balance between phases UPS Status (For all units including telephone system) Standby/Ready UPS fault On bypass On backup power Battery charge status Room Access and Motion Sensing Access/Denied Log Multiple Denied-Access Alert Staff or ghosts in off limits area Excessive or unplanned foot traffic in computer room Room unsafe- Halon or C0 2 system armed and/or dischargedRemote Video Monitoring View unattended sites Visual verification of alarms Failed Computer Peripheral Printer Tape drives Disk drives Computer Systems Failed process - Abort or freeze of a process Failed Computer System- System not responding Operator System/Process warning messages (System resource problems, etc.) Enterprise management system warnings (Unicenter, Openview, etc.) Sound Level Fire, Smoke, other alarms Head Crash Telephone System Impaired or out of service Network Faults Switches Routers Servers Backup Generator Fuel level Standing by/warming up/ delivering power Coolant failure Battery charge status
Return to Spring 1999's Index | Return to DRJ's Homepage | Email Us |