During a fully manned shift
Outside normal working hours
During times of the day when orders are taken, products shipped, etc,
During month-end closing or seasonal peaks
Seconds, minutes, hours or days
(Are they mission critical applications or just staff tools?):
Order entry (mail order, point-of-sale)
Sales support
Manufacturing operation support
On-line shipping
Corporate Web-site
Accounting, Payroll, A/P, A/R
Word processing
Engineering calculations
Training
Instant alert and response
Delayed alert and/or response
Third party vendor response time
Hardware repair time
Fully automated recovery
Semi-automated data recovery
Manual restoration
With improved hardware designs and major gains in component reliability, mainframe hardware failures have declined significantly. The larger portion of downtime is now related to network failures, database and applications software problems and human errors. In moving from central computing to distributed data processing and networks, new potential failure points have been created in small unmanned communications rooms or converted closets.
Another source of downtime is the support equipment for the rooms these systems reside in. Surveys report between 10 percent and 30 percent of the downtime results from faults in facility environment and support equipment. This category is unique in that unlike the hardware, software and people errors, there is often a window of minutes to hours where the fault could be dealt with before computer system function is actually lost.
Clearly it is preferable to try to avoid downtime altogether through careful system design, robust and well tested software, and good operating practices. Since computer installations and operators are not perfect, it is also prudent to structure the system to tolerate faults wherever possible and to automate, or at least facilitate, recovery when faults do occur. "Fail-proof" systems are a help although they can be more difficult to resuscitate when they do crash.
Similar attention needs to be given to the support equipment which provides the environment necessary for reliable operation of the computer system(s). Even fail-proof systems cannot function without power, with excessive heat or when inundated with rising water. Full time monitoring for faults in the environment is important wherever early recognition of the problem would allow correction before the computer systems operation is compromised. Even when downtime cannot be prevented, early recognition is important so fault correction and recovery is not delayed.
Full time monitoring of the computer-systems and networks themselves is also important. Some fault or error-messages on the operator’s console deserve to be dealt with promptly even if they occur during hours when the computer facility is unattended. For most computer systems in use today, it is possible to periodically confirm that the computer is responsive and functioning. Telephone equipment (and associated UPS units) which are essential to the functioning of the computer system and the organization also warrant monitoring.
Many of the important considerations are listed below in outline form.
Good System Design - Computer Hardware and Software
Reliable and/or fault tolerant hardware
Minimal number of critical points where one failure crashes entire system
Thoroughly, tested and debugged O/S and applications software
Mirrored or RAID storage
On-line backup
Integral journals or audit trails for crash recovery
Utilities available for system health and data structure checks
Good Design Practices— Facility & Support Equipment
Based on system’s needs, have adequate control of: 1. Temperature & temperature rate of change 2. Humidity 3. Airborne particles (dust or smoke)
Anti-static conductive floor surface
Minimal need for personnel traffic in and out of main computer room
UPS with sufficient capacity to permit orderly shutdown or transition to alternate power
Operator and Administrator Training
ero tolerance for "cockpit errors"
Prompt and correct response to problem situations
Good Discipline
Routine attention to storage and resource allocation
Routine diagnostic checks of system and database structures
Routine virus checks of system and all imported files
Routine maintenance of system and support equipment: 1. Air conditioning system inspected and repaired; clean air filters 2. Generator fuel, oil, coolant & battery condition
Thorough testing of new hardware and repaired hardware
Rigorous testing of software patches, bug fixes, upgrades and revisions
Routine backup of files, on-line if possible
On-site spares (known to be in working order) for all critical equipment
Detecting Failures
Major system / network failures - instantly obvious in a manned facility
Partial failures affecting only a few system functions - may go undetected for hours or even days
Failures during unmanned periods - may go undetected for hours or days
(Many weekend failures are discovered on Monday morning.)
Faults in support equipment - may take minutes to hours to cause a system failure
Early detection can reduce or eliminate downtime
Recovery from Failures
Applications designed or selected to minimize or eliminate the need for manual intervention for recovery
Recovery procedures automated whenever possible
Fault notification organized to minimize response time
Procedures for recovery documented and tested, including: 1. Who is in charge? 2. What personnel are required? 3. Who needs to be notified of the problem? 4. Who needs estimates on when system will return to service? 5. What outside vendor support is available and how to call for it?
Factors to Consider
These factors should be taken into consideration when deciding what to monitor:
1. Organizational Risk
- How the organization is affected by downtime?
- The cost per hour of downtime
- When the organization is most vulnerable: Days? Nights? Weekends?
2. Past Downtime Problems
- Power unreliable
- Air conditioning system failures
- UPS units that come on-line unnoticed until battery is exhausted
- Peripherals going off-line
- Intruders
- Water leaks
3. Budget constraints
- What is the standard for return on investment or cost effectiveness?
- How does the return on investment for downtime cost control compare to competing investment opportunities?
- Is there a hard limit on available budget?
- Will additional budget be available in the future for upgrades?
Last Major Downtime Disaster
Organizations tend to be more motivated to deal with the problem while the memory of a major disaster is still fresh. If you can’t get action now, wait until just after the next disaster. The goal, of course, is to act before to help avoid the next disaster.
Parameters to Consider Monitoring
This list is for critical rooms with a concentration of hardware and support equipment which may be manned part or all of the time. A subset would be appropriate for unmanned communications or network rooms.
Temperature
Within limits in all areas, including under floors
Rate of change not exceeded
Hot spot detection and balancing between A/C zones
Humidity
Within limits in all areas
Air Conditioning/Chiller Systems
Each unit functioning
Fan failure
Discharge temperature
Water Detection
Under computer floor
From floor above computer room
Sprinkler system/heads
Under pipe work or drainage
Around chillers
From roof
Smoke
Computer Room
Under floor
Support areas
Interface to:
1. Building fire detection system 2. Auto fire suppression system 3. VESDA system
Main Power
Voltage each phase (& neutral if required)
Current balance between phases
UPS Status (For all units including telephone system)
Standby/Ready
UPS fault
On bypass
On backup power
Battery charge status
Room Access and Motion Sensing
Access/Denied Log
Multiple Denied-Access Alert
Staff or ghosts in off limits area
Excessive or unplanned foot traffic in computer room
Room unsafe- Halon or C02 system armed and/or discharged
Remote Video Monitoring
View unattended sites
Visual verification of alarms
Failed Computer Peripheral
Printer
Tape drives
Disk drives
Computer Systems
Failed process - Abort or freeze of a process
Failed Computer System- System
not responding
Operator System/Process warning messages (System resource problems, etc.)
Enterprise management system warnings (Unicenter, Openview, etc.)
Sound Level
Fire, Smoke, other alarms
Head Crash
Telephone System
Impaired or out of service
Network Faults
Switches
Routers
Servers
Backup Generator
Fuel level
Standing by/warming up/
delivering power
Coolant failure
Battery charge status
Tom Poulter graduated from Stanford University with a BS degree in Physics and has been BTI’s CEO for 30 years. Prior to co-founding BTI Computer Systems, Tom was employed at Hewlett Packard designing a signal conditioning product line & as Systems Product Manager for HP’s bundled computer systems including HP’s first timesharing system.




