The Test That Wasn't a Test
- Published on October 28, 2007
In the evening of Wednesday, January 15, 1992, Bluebonnet Savings Bank (BSB) in Dallas, Texas, got to demonstrate first-hand a key DR maxim: a “disaster” should not be thought of only as an external event that strikes computer operations. Rather, a disaster is anything that interrupts the continuity of business operations. And when disaster struck, Bluebonnet was ready.
The 3725 is Down!
That evening, at this multi-billion dollar bank (with 34 branch offices spread around Texas and a mortgage servicing company in Atlanta), MIS operations came to a halt. An attempt to re-IPL the bank’s IBM mainframe failed when the 3725 communications controller would not load. In addition, operations was experiencing problems with bad tracks on the disk drive.
Like most financial institutions, communication with branches and customers is key to continuing effective business operations at Bluebonnet. Anything that removes that communications link is disastrous. “We have to be able to allow customers to withdraw money, get information on account balances, and the like. You just can’t tell people that they can’t withdraw money because you don’t know how much they have in their accounts,” says Chuck Littleton, Disaster Recovery Planner for the Bank. “So it is standard policy for us to declare a disaster on anything that will knock us out for 24 hours or more.”
Therefore, when it became obvious that the problem was not going to be fixed immediately, that is exactly what the bank did. Bluebonnet Saving’s Bank declared a disaster with their IBM hotsite in Tampa, Florida, and activated their business contingency plan, automated with Strohl Systems’ LDRPS software, at 4:15 p.m. on January 16.
By 8:00 that evening, key bank employees were on a plane to Tampa, and by 12:15 a.m. they had begun recovery operations. At 6:00 a.m. the Tampa alternative site system was up and running successfully with all databases loaded.
Back in Dallas, recovery was in progress. By 3:00 a.m. the same morning, the communications controller had been brought back up. “After testing it and solving some communication problems with a few of the branches, we were able to determine that we could switch operations back to Dallas, and we did so at 9:00 a.m. In fact, we were only running live at the hot-site for about three hours,” says Littleton. “But if the problem in Dallas hadn’t been solved, we were ready that Friday morning to be in full operation in a way that would have been transparent to our branches and customers and in a way that would have preserved the continuity of business operations.”
The role of the Plan
Having the hot-site agreement in place was key to Bluebonnet’s ability to react and recover quickly. But just as important, noted BSB’s Disaster Recovery Coordinator Patti Smith, was having an automated business continuity plan that the bank had developed last September.
“We realized that in the event of a disaster, there was a lot of information that we would need to assess quickly,” says Smith. “Things like the names and phone numbers of people we needed to contact, organizational plans, task plans, equipment inventories, and the like. That kind of information is critical to have at your fingertips if you are going to keep doing business and servicing customers.”
So last fall, using plan development software, Smith and the unit managers automated the bank’s recovery plans. They analyzed the needs and functions of their business units and collected the information necessary to ensure the continuity of each key business function in the event of a disaster. “It was the availability of this data from the database that allowed us to react so quickly and efficiently,” says Smith.
The real World Test
“Because we were actually up and running again by 9:00 a.m. Friday in Dallas,” says Littleton, “this experience served as a thorough test of our disaster recovery and business continuity plan.” And there are several key lessons that both Littleton and Smith point to as a result of the experience.
“First,” says Smith, “you absolutely have to have an automated planning tool in order to maintain the data that is needed to effect the recovery process efficiently. There is simply no way, realistically, that anyone could control and update that much information in a simple written plan.”
Littleton adds, “The second lesson we learned is that it is so critical that the data in the database be current and valid that we will now update our continuity plan on a daily rather than a weekly or monthly basis. All staff changes, CPU or other equipment configuration changes, etc. will be input to the database immediately. It has to be current.”
Finally, both agree that the position of Disaster Recovery Coordinator, Smith’s function, must be made clear and the lines of communication kept open for all who are in any way involved in the recovery. “It is really important in order to minimize confusion,” says Littleton. “We had far too many people calling all over the place to ask questions when they should have been dealing directly with Patti. But we’ve cleared that up now. If anything like this ever happens again, everyone knows that Patti is ‘central control’ for all information regarding recovery operations.” In fact, Bluebonnet Savings Bank now regards the position as so important that Smith has been assigned an assistant.
Although Dallas was back up and running on Friday morning, the disaster recovery team that had flown to Tampa stayed on over the weekend to troubleshoot the problems with the modems and communications lines. They returned on Sunday night, tired, but justifiably proud of a job well done.
This time, the disaster was short-lived. But the experience was an important one. It allowed Bluebonnet Savings Bank to test and refine, under fire, the value and quality of their contingency plan. If there ever is a “next time,” they will be prepared.
Mary Lou Roberts is a free-lance writer and industry consultant with more than 25 years of experience in information systems.
This article adapted from Vol. 5 #2.