At a large Midwestern hospital which has a variety of LAN servers and minicomputers located throughout the hospital complex, the capability to recover from an event such as a small fire was in serious doubt.
- In some instances no backups were made.
- In most other instances, the backups were done, but the backup tape was kept within 2-3 feet of the processor, sometimes even in a place of honor sitting right on top of the server. Most of the backup tapes were loosely stacked (tossed in a pile) on an adjacent desktop or thrown into a desk drawer. Only rarely were tapes stored outside of the room in which the server or processor was located. One tape was stored across the hall, and another was stored in a person's home. There were no policies or procedures governing any part of this process.
- There were serious issues about the reliability of some of the tapes because, in some instances, tapes were used over and over again. There was no planned replacement for worn tapes. There was also no testing or verification that a given tape actually could be used to restore the system. It never occurred to the non-data processing person that tapes wear out or that the backups may not have been done correctly.
- A wide variety of tape drives were used by this hospital, to the extent that, in a few cases, there were serious doubts whether a compatible tape drive could be found if the primary unit was inoperable (destroyed in a fire or just simply not working). The hospital had no standards on tape form or format and had no centralized review over the procurement of such items. Managers did their own thing and innumerable varieties of tape formats and manufacturers resulted.
- The frequency of backups, or rather the infrequency, was such that a significant amount of data would have been lost if the failure occurred in a worst case scenario - a Friday morning system failure before the once-a-week Friday afternoon backup. Again, no policies existed concerning frequency of backup.
- The lessons learned are obvious: management needs to exercise control over the backup and storage process and develop policies and standards governing these issues. Failure to do so will lead to costly losses. In the wrong situation at the wrong time, such as in the case of this hospital, the inability to quickly recover critical data could lead to the death of a patient.
Modems and Dial-Up Communications
The once rare modem has proliferated in recent years. As transmission speed has increased from 300 baud to 28,800 and above, the cost of modems has dropped dramatically. However, several years back when 1200 baud modems were the norm and compatibility was an issue - at least for some organizations - we encountered a situation at a large food processing and distribution company that illustrated the need for standards and testing of backup communications. This company operated numerous food processing plants and warehouses located throughout the Midwest. These facilities were connected to the main data center by dial-up access. About 35 different makes and models of modems were in use at their facilities and in their data center. Only some of these modem pairs were compatible with each other. Fewer still were compatible with the modems at the hot site used by the company for their disaster recovery plan. The company had to replace many of the modems so that, in case of a disaster, the modems in use at their widespread facilities would be compatible with the modems at the hot site. This was the only way to reasonably insure restoration of communications at the time of a disaster.
In a similar vein, another client had numerous branch offices connected to the primary data center through the use of leased lines. The communications equipment (modems) on the leased lines could not be used with dial-up lines. Therefore, there was no capability to use a dial-up line in case of an emergency. A construction project took out the leased line for a few days, but other voice circuits still functioned. The company was not able to use a dial-up line to restore data communications, and was not able to procure the necessary dial-up modems (this was in the early 1990's) quickly enough to keep their operations going. They resorted to physically transporting data between locations. No thought had been given to designing the network to permit dial-up communications if the leased lines failed. Management tried to save a few dollars up front by buying equipment with limited functionality and ended up spending many times that amount in lost productivity.
This situation, or ones very similar to it, was observed at three other companies.
The lesson learned is to design the network and the system with redundancy and recoverability in mind. Poor planning and design up front will not only make disaster recovery planning difficult, they may require an expensive redesign of the entire system. The disaster recovery professional should be consulted when systems are being developed so that recoverability is designed into the system.
One of the most common recovery strategies embraced by senior management is the 'we will do it manually' approach. In some extremely rare situations, a manual recovery strategy might work for a short time, but in most situations, management is only fooling themselves if they think they can survive a catastrophic event and continue to operate using manual procedures.
In one hospital for which we performed a business impact analysis, 15,000 laboratory tests were performed each day. The results produced by various instruments and devices located in several building were sent to a Digital Equipment processor which in turn fed the results to an IBM mainframe. The output of this process was then distributed to the doctors and nurses through out the hospital complex. A few selected personnel could get lab results from the DEC processor if the IBM processor were down; however, the DEC and IBM processors were located in the same data center exposing them both to the same potential disaster.
Analysis of the criticality of the laboratory requirement along with over 20 other hospital functions, clearly indicated that recovery needed to be accomplished within 48 hours. Our recommended strategy was a hot site solution. Management rejected that solution in favor of a 'we'll do it manually' approach. Doing it manually referred not only to distributing the results of 15,000 daily lab tests, but to all functions supported by the data center.
The hospital had spent about $15 million to acquire facilities, hardware and software; install new systems; and automate numerous hospital functions. Dozens of positions had been eliminated as automation was implemented. Manual processes were also discarded; the forms and procedures to use them were either scrapped or fell into disuse. In spite of this, management made the decision that, if a disaster occurred, they would operate manually. The decision ignored the results of more than 60 interviews and was done because a hot site was considered too costly.
The lesson learned is directed at management: a viable, realistic recovery strategy is a cost of doing business. Doing it manually after spending millions to automate and downsize is simply not realistic. The experienced staff members who used to do it manually are gone. The forms and procedures haven't been used in years, and nobody knows how to fill out the forms anyway. Denying this business reality may prove to be a very costly mistake.
Gerald A. Sands is a senior consultant with The Netplex Group, Inc.