logo.GIF (13456 bytes)

Minimizing the Cost of

Computer/Network Downtime

by: Tom Poulter

Introduction

Computers and networks have become an integral part of daily operations for business, education and government organizations alike. For those increasingly dependent on computers and networks for routine operations, downtime or data loss can be devastating, impacting earnings and even market valuation. This paper explores the costs and causes of downtime as well as ways to minimize such costs through downtime prevention, early problem detection and effective recovery capabilities.

The Cost of Computer / Network Downtime

The impact of downtime on an organization ranges from a minor inconvenience to an inability to perform necessary business tasks with resulting loss of productivity, revenue, and even customers and market share. One survey on downtime reported industry average numbers of $80,000 per hour, four hours average downtime and nine occurrences per year for a loss of nearly $3 million per organization per year. Another survey reported annual losses of $350 thousand to $11 million per organization with an average annual loss of $5 million.

These surveys covered a variety of organizations with different sizes, in different industries and with different problems, but are useful to demonstrate how significant downtime costs have become. An organization trying to reduce its downtime costs needs to focus not on such averages but on its own particular cost structure.

Estimating downtime costs deserves some care. Lost sales opportunities mean the loss of not just the profit on those sales, but of the entire gross margin which also helped pay marketing, administrative and development costs. The direct labor and benefit costs for idled workers will add to the cost unless that work can be deferred during the downtime and caught up afterwards without any overtime or extra cost. The loss of customers and market share has implications for the future as well.

Generally the cost of downtime depends on the frequency and duration of downtime periods as well as the degree to which computer/networks have become "mission critical" to the organization. Below is an outline of some of the factors which affect the cost of downtime.

Timing of downtime:

During a fully manned shift

Outside normal working hours

During times of the day when orders are taken, products shipped, etc,

During month-end closing or seasonal peaks

Duration of downtime:

Seconds, minutes, hours or days

Operations affected

(Are they mission critical applications or just staff tools?):

Order entry (mail order, point-of-sale)

Sales support

Manufacturing operation support

On-line shipping

Corporate Web-site

Accounting, Payroll, A/P, A/R

e-mail

Word processing

Engineering calculations

Training

Speed of Response:

Instant alert and response

Delayed alert and/or response

Third party vendor response time

Ease of Recovery:

Hardware repair time

Fully automated recovery

Semi-automated data recovery

Manual restoration

Causes of Computer / Network Downtime

With improved hardware designs and major gains in component reliability, mainframe hardware failures have declined significantly. The larger portion of downtime is now related to network failures, database and applications software problems and human errors. In moving from central computing to distributed data processing and networks, new potential failure points have been created in small unmanned communications rooms or converted closets.

Another source of downtime is the support equipment for the rooms these systems reside in. Surveys report between 10 percent and 30 percent of the downtime results from faults in facility environment and support equipment. This category is unique in that unlike the hardware, software and people errors, there is often a window of minutes to hours where the fault could be dealt with before computer system function is actually lost.

Minimizing Losses

Clearly it is preferable to try to avoid downtime altogether through careful system design, robust and well tested software, and good operating practices. Since computer installations and operators are not perfect, it is also prudent to structure the system to tolerate faults wherever possible and to automate, or at least facilitate, recovery when faults do occur. "Fail-proof" systems are a help although they can be more difficult to resuscitate when they do crash.

Similar attention needs to be given to the support equipment which provides the environment necessary for reliable operation of the computer system(s). Even fail-proof systems cannot function without power, with excessive heat or when inundated with rising water. Full time monitoring for faults in the environment is important wherever early recognition of the problem would allow correction before the computer systems operation is compromised. Even when downtime cannot be prevented, early recognition is important so fault correction and recovery is not delayed.

Full time monitoring of the computer-systems and networks themselves is also important. Some fault or error-messages on the operator’s console deserve to be dealt with promptly even if they occur during hours when the computer facility is unattended. For most computer systems in use today, it is possible to periodically confirm that the computer is responsive and functioning. Telephone equipment (and associated UPS units) which are essential to the functioning of the computer system and the organization also warrant monitoring.

Many of the important considerations are listed below in outline form.

Preventing Downtime

Good System Design - Computer Hardware and Software

Reliable and/or fault tolerant hardware

Minimal number of critical points where one failure crashes entire system

Thoroughly, tested and debugged O/S and applications software

Mirrored or RAID storage

On-line backup

Integral journals or audit trails for crash recovery

Utilities available for system health and data structure checks

Good Design Practices— Facility & Support Equipment

Based on system’s needs, have adequate control of: 1. Temperature & temperature rate of change 2. Humidity 3. Airborne particles (dust or smoke)

Anti-static conductive floor surface

Minimal need for personnel traffic in and out of main computer room

UPS with sufficient capacity to permit orderly shutdown or transition to alternate power

Operator and Administrator Training

Zero tolerance for "cockpit errors"

Prompt and correct response to problem situations

Good Discipline

Routine attention to storage and resource allocation

Routine diagnostic checks of system and database structures

Routine virus checks of system and all imported files

Routine maintenance of system and support equipment: 1. Air conditioning system inspected and repaired; clean air filters 2. Generator fuel, oil, coolant & battery condition

Thorough testing of new hardware and repaired hardware

Rigorous testing of software patches, bug fixes, upgrades and revisions

Routine backup of files, on-line if possible

On-site spares (known to be in working order) for all critical equipment

Detecting Failures

Major system / network failures - instantly obvious in a manned facility

Partial failures affecting only a few system functions - may go undetected for hours or even days

Failures during unmanned periods - may go undetected for hours or days

(Many weekend failures are discovered on Monday morning.)

Faults in support equipment - may take minutes to hours to cause a system failure

Early detection can reduce or eliminate downtime

Recovery from Failures

Applications designed or selected to minimize or eliminate the need for manual intervention for recovery

Recovery procedures automated whenever possible

Fault notification organized to minimize response time

Procedures for recovery documented and tested, including: 1. Who is in charge? 2. What personnel are required? 3. Who needs to be notified of the problem? 4. Who needs estimates on when system will return to service? 5. What outside vendor support is available and how to call for it?

Monitoring Practices

Factors to Consider

These factors should be taken into consideration when deciding what to monitor:

1. Organizational Risk

• How the organization is affected by downtime?

• The cost per hour of downtime

• When the organization is most vulnerable: Days? Nights? Weekends?

2. Past Downtime Problems

• Power unreliable

• Air conditioning system failures

• UPS units that come on-line unnoticed until battery is exhausted

• Peripherals going off-line

• Intruders

• Water leaks

3. Budget constraints

• What is the standard for return on investment or cost effectiveness?

• How does the return on investment for downtime cost control compare to competing investment opportunities?

• Is there a hard limit on available budget?

• Will additional budget be available in the future for upgrades?

Last Major Downtime Disaster

Organizations tend to be more motivated to deal with the problem while the memory of a major disaster is still fresh. If you can’t get action now, wait until just after the next disaster. The goal, of course, is to act before to help avoid the next disaster.

Parameters to Consider Monitoring

This list is for critical rooms with a concentration of hardware and support equipment which may be manned part or all of the time. A subset would be appropriate for unmanned communications or network rooms.

Temperature

Within limits in all areas, including under floors

Rate of change not exceeded

Hot spot detection and balancing between A/C zones

Humidity

Within limits in all areas

Air Conditioning/Chiller Systems

Each unit functioning

Fan failure

Discharge temperature

Water Detection

Under computer floor

From floor above computer room

Sprinkler system/heads

Under pipe work or drainage

Around chillers

From roof

Smoke

Computer Room

Under floor

Support areas

Interface to:

1. Building fire detection system 2. Auto fire suppression system 3. VESDA system

Main Power

Voltage each phase (& neutral if required)

Current balance between phases

UPS Status (For all units including telephone system)

Standby/Ready

UPS fault

On bypass

On backup power

Battery charge status

Room Access and Motion Sensing

Access/Denied Log

Multiple Denied-Access Alert

Staff or ghosts in off limits area

Excessive or unplanned foot traffic in computer room

Room unsafe- Halon or C02 system armed and/or discharged

Remote Video Monitoring

View unattended sites

Visual verification of alarms

Failed Computer Peripheral

Printer

Tape drives

Disk drives

Computer Systems

Failed process - Abort or freeze of a process

Failed Computer System- System

not responding

Operator System/Process warning messages (System resource problems, etc.)

Enterprise management system warnings (Unicenter, Openview, etc.)

Sound Level

Fire, Smoke, other alarms

Head Crash

Telephone System

Impaired or out of service

Network Faults

Switches

Routers

Servers

Backup Generator

Fuel level

Standing by/warming up/

delivering power

Coolant failure

Battery charge status

Tom Poulter graduated from Stanford University with a BS degree in Physics and has been BTI’s CEO for 30 years. Prior to co-founding BTI Computer Systems, Tom was employed at Hewlett Packard designing a signal conditioning product line & as Systems Product Manager for HP’s bundled computer systems including HP’s first timesharing system.

Return to Spring 1999's Index | Return to DRJ's Homepage | Email Us