DRJ's Spring 2019

Conference & Exhibit

Attend The #1 BC/DR Event!

Winter Journal

Volume 31, Issue 4

Full Contents Now Available!

Introduction

Computers and networks have become an integral part of daily operations for business, education and government organizations alike. For those increasingly dependent on computers and networks for routine operations, downtime or data loss can be devastating, impacting earnings and even market valuation. This paper explores the costs and causes of downtime as well as ways to minimize such costs through downtime prevention, early problem detection and effective recovery capabilities.

The Cost of Computer / Network Downtime

The impact of downtime on an organization ranges from a minor inconvenience to an inability to perform necessary business tasks with resulting loss of productivity, revenue, and even customers and market share. One survey on downtime reported industry average numbers of $80,000 per hour, four hours average downtime and nine occurrences per year for a loss of nearly $3 million per organization per year. Another survey reported annual losses of $350 thousand to $11 million per organization with an average annual loss of $5 million.

These surveys covered a variety of organizations with different sizes, in different industries and with different problems, but are useful to demonstrate how significant downtime costs have become. An organization trying to reduce its downtime costs needs to focus not on such averages but on its own particular cost structure.

Estimating downtime costs deserves some care. Lost sales opportunities mean the loss of not just the profit on those sales, but of the entire gross margin which also helped pay marketing, administrative and development costs. The direct labor and benefit costs for idled workers will add to the cost unless that work can be deferred during the downtime and caught up afterwards without any overtime or extra cost. The loss of customers and market share has implications for the future as well. Generally the cost of downtime depends on the frequency and duration of downtime periods as well as the degree to which computer/networks have become "mission critical" to the organization. Below is an outline of some of the factors which affect the cost of downtime.

Timing of downtime:

During a fully manned shift
Outside normal working hours
During times of the day when orders are taken, products shipped, etc,
During month-end closing or seasonal peaks

Duration of downtime:

Seconds, minutes, hours or days

Operations affected

(Are they mission critical applications or just staff tools?):
Order entry (mail order, point-of-sale)
Sales support
Manufacturing operation support
On-line shipping
Corporate Web-site
Accounting, Payroll, A/P, A/R
e-mail
Word processing
Engineering calculations
Training

Speed of Response:

Instant alert and response
Delayed alert and/or response
Third party vendor response time

Ease of Recovery:

Hardware repair time
Fully automated recovery
Semi-automated data recovery
Manual restoration

Causes of Computer / Network Downtime

With improved hardware designs and major gains in component reliability, mainframe hardware failures have declined significantly. The larger portion of downtime is now related to network failures, database and applications software problems and human errors. In moving from central computing to distributed data processing and networks, new potential failure points have been created in small unmanned communications rooms or converted closets.

Another source of downtime is the support equipment for the rooms these systems reside in. Surveys report between 10 percent and 30 percent of the downtime results from faults in facility environment and support equipment. This category is unique in that unlike the hardware, software and people errors, there is often a window of minutes to hours where the fault could be dealt with before computer system function is actually lost.

Minimizing Losses

Clearly it is preferable to try to avoid downtime altogether through careful system design, robust and well tested software, and good operating practices. Since computer installations and operators are not perfect, it is also prudent to structure the system to tolerate faults wherever possible and to automate, or at least facilitate, recovery when faults do occur. "Fail-proof" systems are a help although they can be more difficult to resuscitate when they do crash.

Similar attention needs to be given to the support equipment which provides the environment necessary for reliable operation of the computer system(s). Even fail-proof systems cannot function without power, with excessive heat or when inundated with rising water. Full time monitoring for faults in the environment is important wherever early recognition of the problem would allow correction before the computer systems operation is compromised. Even when downtime cannot be prevented, early recognition is important so fault correction and recovery is not delayed.

Full time monitoring of the computer-systems and networks themselves is also important. Some fault or error-messages on the operator’s console deserve to be dealt with promptly even if they occur during hours when the computer facility is unattended. For most computer systems in use today, it is possible to periodically confirm that the computer is responsive and functioning. Telephone equipment (and associated UPS units) which are essential to the functioning of the computer system and the organization also warrant monitoring.

Many of the important considerations are listed below in outline form.

Preventing Downtime


Good System Design - Computer Hardware and Software

Reliable and/or fault tolerant hardware
Minimal number of critical points where one failure crashes entire system
Thoroughly, tested and debugged O/S and applications software
Mirrored or RAID storage
On-line backup
Integral journals or audit trails for crash recovery
Utilities available for system health and data structure checks

Good Design Practices— Facility & Support Equipment

Based on system’s needs, have adequate control of: 1. Temperature & temperature rate of change 2. Humidity 3. Airborne particles (dust or smoke)
Anti-static conductive floor surface
Minimal need for personnel traffic in and out of main computer room
UPS with sufficient capacity to permit orderly shutdown or transition to alternate power

Operator and Administrator Training

ero tolerance for "cockpit errors"
Prompt and correct response to problem situations

Good Discipline

Routine attention to storage and resource allocation
Routine diagnostic checks of system and database structures
Routine virus checks of system and all imported files
Routine maintenance of system and support equipment: 1. Air conditioning system inspected and repaired; clean air filters 2. Generator fuel, oil, coolant & battery condition
Thorough testing of new hardware and repaired hardware
Rigorous testing of software patches, bug fixes, upgrades and revisions
Routine backup of files, on-line if possible
On-site spares (known to be in working order) for all critical equipment

Detecting Failures

Major system / network failures - instantly obvious in a manned facility
Partial failures affecting only a few system functions - may go undetected for hours or even days
Failures during unmanned periods - may go undetected for hours or days
(Many weekend failures are discovered on Monday morning.)
Faults in support equipment - may take minutes to hours to cause a system failure
Early detection can reduce or eliminate downtime

Recovery from Failures

Applications designed or selected to minimize or eliminate the need for manual intervention for recovery
Recovery procedures automated whenever possible
Fault notification organized to minimize response time
Procedures for recovery documented and tested, including: 1. Who is in charge? 2. What personnel are required? 3. Who needs to be notified of the problem? 4. Who needs estimates on when system will return to service? 5. What outside vendor support is available and how to call for it?

Monitoring Practices

Factors to Consider

These factors should be taken into consideration when deciding what to monitor:

1. Organizational Risk

  • How the organization is affected by downtime?
  • The cost per hour of downtime
  • When the organization is most vulnerable: Days? Nights? Weekends?

2. Past Downtime Problems

  • Power unreliable
  • Air conditioning system failures
  • UPS units that come on-line unnoticed until battery is exhausted
  • Peripherals going off-line
  • Intruders
  • Water leaks

3. Budget constraints

  • What is the standard for return on investment or cost effectiveness?
  • How does the return on investment for downtime cost control compare to competing investment opportunities?
  • Is there a hard limit on available budget?
  • Will additional budget be available in the future for upgrades?

Last Major Downtime Disaster

Organizations tend to be more motivated to deal with the problem while the memory of a major disaster is still fresh. If you can’t get action now, wait until just after the next disaster. The goal, of course, is to act before to help avoid the next disaster.

Parameters to Consider Monitoring

This list is for critical rooms with a concentration of hardware and support equipment which may be manned part or all of the time. A subset would be appropriate for unmanned communications or network rooms.

Temperature

Within limits in all areas, including under floors
Rate of change not exceeded
Hot spot detection and balancing between A/C zones

Humidity

Within limits in all areas

Air Conditioning/Chiller Systems

Each unit functioning
Fan failure
Discharge temperature

Water Detection

Under computer floor
From floor above computer room
Sprinkler system/heads
Under pipe work or drainage
Around chillers
From roof

Smoke

Computer Room
Under floor
Support areas

Interface to:

1. Building fire detection system 2. Auto fire suppression system 3. VESDA system

Main Power

Voltage each phase (& neutral if required)
Current balance between phases

UPS Status (For all units including telephone system)

Standby/Ready
UPS fault
On bypass
On backup power
Battery charge status

Room Access and Motion Sensing

Access/Denied Log
Multiple Denied-Access Alert
Staff or ghosts in off limits area
Excessive or unplanned foot traffic in computer room
Room unsafe- Halon or C02 system armed and/or discharged

Remote Video Monitoring

View unattended sites
Visual verification of alarms

Failed Computer Peripheral

Printer
Tape drives
Disk drives

Computer Systems

Failed process - Abort or freeze of a process
Failed Computer System- System
not responding
Operator System/Process warning messages (System resource problems, etc.)
Enterprise management system warnings (Unicenter, Openview, etc.)

Sound Level

Fire, Smoke, other alarms
Head Crash

Telephone System

Impaired or out of service

Network Faults

Switches
Routers
Servers

Backup Generator

Fuel level
Standing by/warming up/
delivering power
Coolant failure
Battery charge status



Tom Poulter graduated from Stanford University with a BS degree in Physics and has been BTI’s CEO for 30 years. Prior to co-founding BTI Computer Systems, Tom was employed at Hewlett Packard designing a signal conditioning product line & as Systems Product Manager for HP’s bundled computer systems including HP’s first timesharing system.