DRJ's Spring 2019

Conference & Exhibit

Attend The #1 BC/DR Event!

Spring Journal

Volume 32, Issue 1

Full Contents Now Available!

Disaster planning is acknowledged to be essential for corporate survival. But, unless disaster plans are thoroughly tested periodically, they can actually lull companies into a state of inadequate semi-preparedness.

Fortunately, the Mead Corporation realized the importance of hotsite testing their data recovery process before it was too late. In preparing for their first hotsite test, Al Tokarsky, Senior Systems Programmer, realized that with Mead’s existing data recovery system, at least three days would be required to restore business critical applications—such as Electronic Data Interchange (EDI), spreadsheet applications, financial analysis packages, and an internal communications application—in the event of a disaster.

“We had been using the same approach to data recovery for a number of years,” he says. “But as we started getting more concerned with disaster recovery, I looked more closely at how our backup and restore product worked. It soon became evident that we’d be in real trouble if we had to rely on that product in an actual outage.”
Tokarsky’s first step in remedying this situation was to identify recovery standards. “My original goal,” he recalls, “was to fully recover the entire VM system in a hot-site test in under five hours.”

That original target has been slashed as a result of a recent hot-site test when Mead finished a complete base restoration of critical business data in just two hours and thirty-five minutes. With the knowledge gained through the hot-site test, however, Tokarsky now believes the recovery period can be cut even further.

“We’re still in the process of streamlining recovery procedures,” he says. “We found that by running a stand-alone module of our restoration system directly we can simplify the environment so that we won’t have to depend on any other tape management products in the recovery process.”

The base tapes used in recovery operations are created weekly and shipped offsite along with a listing of all the tapes that would be required for recovery, including NSS (named saved system) tapes, and the key restoration system tapes.

“Every week we take a complete base backup — essentially a snapshot of all of our data as it exists at the time,” Tokarsky explains. “This physical, cylinder for cylinder representation of the DASD can be restored faster than incremental backups because it is not dependent on the CMS file structure, and verification of each file is not required.”

In addition, Mead makes two incremental backups daily, sending the first copy offsite for secure storage, and keeping the second copy onsite for ad hoc file restores. The daily incremental tapes are cumulative, and include all data changed since the previous base backup was made. Each incremental backup typically incorporates 5000-6000 user IDs, while the full base generally has over 7700.

The problem with Mead’s previous backup and restore system was that when the base backups were made the data had to be compressed — and decompressed — before the backup tapes could be used in a recovery operation.

“This presented a Catch 22 situation,” Tokarsky says, “where we had to have a base system up in order to decompress the files, but we needed those files in order to get the base system up. If our hot-site was down for any reason and we were forced to migrate to a cold site, restoration would be virtually impossible.”

Mead’s new system, called SYBACK, solves this problem because it can operate as a stand-alone module and does not require uncompressed files. “All we do,” Tokarsky says, “is enter the hotsite, verify the tapes are there, and load the key tape containing the two catalogue files and the stand-alone module. Since the Vol Sers needed to run the job are all in these files, no full file catalogue product is needed. The system then uses one file as input for the base restore, and with all DASD virtually attached, the job proceeds automatically. All we do is mount tapes as prompted by the system. Less than three hours later we’re done.”

Once the base is fully restored, Mead then restores the incrementals, a process which in their most recent hotsite test took just one hour and 40 minutes. But again, Tokarsky stresses, with additional hot-site tests, that number is expected to be reduced to as little as one hour.

“Hot-site tests not only demonstrate that our disaster plans work, but also provide us the opportunity to improve the process and trim valuable minutes,” he says. “When dealing with business critical applications, absolutely minimizing restore time is critical because every minute cut from the restoration process can translate into thousands of dollars saved.”

Ira Goodman is Software Services Manager at Syncsort, Inc., Woodcliff Lake, New Jersey, developers of SYBACK, Mead’s data backup and restoration system.

This article adapted from Vol. 4, No. 3, p. 21.

When I first took up skiing, the instructor drilled us again and again on the proper way to fall. By the end of my first season, I had the theory and plan of action down pat.

The problem was that every time I got into trouble, I rarely fell according to “the plan.”

This type of reality versus theory conflict is even more frustrating when it comes to a data center’s MVS system going down. Although few companies will ever experience a major disaster, it’s like skiing--you need a recovery plan just in case.


A novice who grabs a pair of skis and heads for the expert slopes is considered a fool. Most people get advice and assistance so they have the proper equipment. Then, they take lessons and continually practice to hone their skills and their disaster (spill) recovery.

Not every spill that a skier encounters will be according to plan, but having a plan can prevent injuries. The same is true of data center disasters. Any crash you can walk away from is a good recovery.

The Right Tools for the Job

The first thing you need for a sound DRP is the proper equipment, beginning with a good team. Since DRPs are time-consuming and cumbersome to develop, organizations usually assign the task to staff members who are available for extended periods of time...whether they are qualified for the task or not. But it pays to wait, because the right people will more quickly produce a better plan that will generally be less dependent on the “critical” staff.

One thing you don’t need is excessive documentation. Reams and reams of written documentation won’t guarantee that you will have the information you need when you need it most. DRPs should be designed to handle an unpredictable event. But by their very nature, unpredicted events cannot be prepared for, no matter how much documentation you have. A short, concise DRP that can be readily understood is more effective than one that covers “every” possibility.

Another problem with DRP documentation is that by necessity, it’s written ahead of time. It is a projection from historical records of what you think the situation will be when disaster strikes. But what if things do not take place exactly as predicted? You need a way to look at things as they exist in realtime--i.e., a one-pack backup system. Although slow and subject to the same problems as the primary system, it’s better than spending the time and money traveling to your hot-site, hoping that the system will work in sync with your primary system.

Get in Sync

Systems can get out of sync for a variety of reasons, some intentional and some unintentional.

Although the technical staff will generally make all of the changes perfectly on the primary systems, they may not remember (or even be aware) that the DRP system also needs updating.

Or, perhaps each week, you alternate between updating the multiple systems...actually forcing your systems out of sync.

The fact is, since most out of sync situations are oversights, you can be sure that they won’t be documented; there is no easy way to prevent these problems.

Your one-pack or starter system is a reasonable alternative, but what’s really needed is a simple tool that allows you to quickly determine what the status is. One allows you to make minor corrections with ease

Under Lock and Key

The security of your system and your data is vital--it cannot be compromised. But when a disaster strikes, you need to be certain that your security system won’t keep you from getting the system up.

The ideal solution gives you a contingency that you can use in a dire situation, but one that can be kept under “lock and key” so it cannot be misused.

While dire situations rarely occur, any downtime is costly, so plan for it and have tools available for your people to bring the system back up quickly.

Then, develop a DRP that’s simple, include the right tools and hold tight to your polls.

Paul Robichaux is Chairman of the Board for NewEra Software, Inc.

This article adapted from Vol. 4 No. 1, p. 18.

On November 3rd a pipe burst in the computer room ceiling sending water cascading into the controller resulting in the loss of the controller and some DASD. This coincided simultaneously with a breakdown of the telephone system . . . or so said the note handed to all Data Processing employees that morning at approximately 10:00 a.m. The note did not come as a complete surprise as the Data Processing Director had given warning that a mock disaster drill would soon take place in order to test the company’s fledgling Disaster Recovery Plan.

During the drill, the team members met in an established “command center” to report on the progress of their area. After the drill, each team member was asked to comment on his role, other’s role, and the plan in general, the results of which I have summarized here.

As might be expected, the primary purpose of a walkthrough is to test the disaster plan. However, the walkthrough also produced some unexpected and beneficial results. Besides the plan, the walkthrough tested people and their perception of how they, their respective areas, and other areas functioned within the department.

As regards to testing people, the DASD Manager proved to be the key individual in the drill. He stated the technical situation and then precisely delineated his step-by-step approach to the problem. His importance was commented on by virtually all the other team members who were also concerned by the absence of anyone to back him up.

The walkthrough also challenged participants to be creative, for it would be impossible for any workable plan to be so codified as to meet every contingency. This was nowhere better demonstrated that with the operations shift supervisor, who showed some realistic imagination when he reported, “We attempted to quiesce the system but were prevented due to sparks flying in the Machine Room. Therefore it was necessary to hit the emergency power button. We disconnected the Halon System and covered all equipment possible with plastic.” He then sent runners (as the telephone system inoperative), to notify Physical Plant to shut off the water and requested pumps to remove same. Showing initiative, he sent an Operator out to buy hair dryers, using the Operations Manager’s credit card, in order to dry the equipment.

The participants’ perceptions of how the department functioned and how other sections of the department should function proved most instructive. The communications member noted that the disaster emphasized the true interactive nature of the department. However, reflecting his area, he believed that Operations needed to be more aware of their environment (power circuits, water pipes) and that Applications needed education in hardware/communications terminology. Indeed, this “jargon” manifested itself throughout the team members’ reports. One discovered remedy for the jargon problem proved to be the log book. It acted as central recording device for all technical events, since the individual designated as “scribe” could not be everywhere at once and suffered from the same jargon problem that plagued other non-specialists.

The walkthrough also tested not only the plan itself but the elements necessary for a successful walkthrough. The operations shift supervisor noted that the message implementing the disaster drill made no mention of what application was to be running at the time of the disaster, a point well-taken since a critical payroll job stream was actually running at the time.

Generally, the team members thought the drill to be a success; indeed even a disasterous walkthrough would have been successful since it would have pointed out the plan’s flaws. The walkthrough was beneficial on several levels. The Assistant Director was impressed by the preparedness of the Systems staff, especially the fact that they had multiple copies of essential backup tapes. He also stressed the need for CICS forward recovery for VSAM files and off-site storage (currently in place). His only criticism was the lack of total communication between participants. The Systems Manager noted that with the loss of SYS1.LINKLIB Systems personnel would have been unable to log on to TSO. A skeleton TSO procedure and ID were set up to address this problem. Most of the walkthrough’s results fell within two areas: expected and unexpected.

Expected results generally manifested themselves in the establishment of an offsite storage facility, CICS forward recovery for VSAM files, and adequate backup for key personnel. Like any data processing product, there were “bugs” in the plan that had to be corrected.

It was, however, the unexpected results that proved to be most instructive. This basically amounted to a newly discovered sense of confidence that a disaster could be met and that the demonstrated flexibility, imagination, and professionalism of the staff could overcome minor plan flaws. The drill brought out a degree of professionalism theretofore unnoticed in some individuals. The drill also demonstrated the highly diverse nature of the data processing department as a whole, and the segregated areas of specialization within each area. Recognizing that, the walkthrough was useful in that it gave the department a project in which they could function as a single unit, adding to a greater understanding of how their counterparts functioned. For the individual charged with the development and maintenance of his department’s disaster plan, the walkthrough can have all the timeliness of yesterday’s newspaper and can fall victim to its own success. Rememberance of a successful walkthrough can breed a false sense of confidence. Personnel turnover, rapidly changing hardware and software, and out-of-date documentation all point up the need for the walkthrough, to be truly useful, to be a scheduled, periodic event. Only in this way can the disaster plan act as an educational tool for management and new employees and give the organization the knowledge and tested skills needed to reduce the effects of a true disaster.

Robert D. Hargrove is a contingency planner at the University of Texas Health Science Center.

This article adapted from Vol. 2, No. 4, p. 11.

With the relatively recent onslaught of such natural disasters as Hurricane Hugo and the Loma Prieta earthquake, many businesses are realizing just how crucial it is to develop and update their disaster recovery plans. While this is a good first step, it is by no means an adequate enough precautionary measure if there is no testing before, during, and after the plan is implemented. Testing is what indicates the effectiveness of a plan. Therefore, it is important that as much care be exercised in testing the plan as in developing it. Time has a way of eroding a plan’s effectiveness for the following reasons:

  • Environmental changes occur as organizations evolve, new products are introduced, and new policies and procedures are developed. Such changes can render a plan incomplete or inadequate.
  • Hardware, software and other critical equipment change.
  • Personnel may lose interest or forget critical parts of the plan.
  • The organization may experience personnel turnover.

The destruction caused by natural disasters across the country such as earthquakes in California, hurricanes on the Texas coast and tornadoes in the Midwest have made disaster recovery planning a major topic of discussion within many companies.

Dependency on computerized operations means that large or small banks have a major need for disaster recovery planning. Add this to the specific time frame for check processing and the need to complete fund transfers as quickly as possible, and disaster recovery planning becomes even more necessary. This also raises an interesting question: if a bank does not have an effective, tested plan and as a result cannot continue operations after a disaster occurs, will this have a ripple effect on other banks or on the economy? A study by the California Bankers Association concluded that the loss of check processing in the Los Angeles area would affect California’s economy in 3 days, the nation’s economy in 5 days, and the world’s economy in 7 days. Could a disaster affecting your bank or several banks in your area result in a ripple effect on a local, national or international basis?

The Controller of the Currency recognized the need for disaster recovery planning with the issue of Banking Circular BC177 in 1983. A subsequent revision in 1987 moved disaster recovery planning from the computer operations to the major operational areas. Circular BC226 also recognizes the need for such planning with PC’s and other-end user systems. We understand that the Controller of the Currency has other areas of computer operations under review (e.g. banks that use outside service bureaus).

Disaster recovery planning goes by many names, but regardless of what it is called, it must detail the actions to be taken and the resources to be used to maintain critical company business functions in an emergency situation.

How is this achieved? A plan involves:

  • Preplanning
  • Making an alternate processing site agreement (hotsite, coldsite or reciprocal agreement)
  • Documenting the responsibilities of personnel in an emergency
  • Testing, testing, and more testing!

In other words, it involves being prepared for the worst!

Many articles have been written about disaster recovery planning. While many refer to the use of alternate sites and testing, they do not necessarily emphasize the need to choose the correct alternate site and then to test using that site.

What is the correct alternate site? The initial assessment to define the company’s needs in an emergency will provide information on the requirements for the alternate site. This assessment is usually referred to as a risk analysis or business impact analysis. Completion of such an analysis is an important function in the disaster recovery planning process and provides a firm foundation for the development of an effective plan.

Which site is best for your operations will depend on the findings of this analysis. Once chosen, testing at the site will confirm the accuracy of that decision. The plan document provided the road map and instructions on how to react and use the site. Testing is like a football team practicing before the Super Bowl. It enables you to practice the moves that will allow you to reach the goal of maintaining services in a disaster with minimum inconvenience to your customers.

Testing allows you to resolve problems that could affect the chances of achieving that goal. It also enables you to practice resolving problems. Keeping problems to a minimum and knowing how to resolve them are key points in a successful recovery. It must be remembered that pressures on personnel will be immense during the recovery period. As a result, solving problems can take longer than in a normal work mode. The time it takes to resolve different problems could be 5 to 10 times greater because the pressure to get them resolved may cause other problems.

We interviewed three banks for their views on disaster recovery planning and alternative site processing in emergency situations.

The First Bank of Troy, North Carolina uses Burroughs equipment. John C. Wallace, President and CEO, stated that they realized the importance of disaster recovery for the reason of self preservation. Wallace has been in banking for 50 years and was used to doing business manually for many years. They had to use proof machines and update the general ledger by hand. It took many employees many hours to update the journal, and customers were understanding if their account was not updated until the next day. He stated that no matter how much money and how many resources were available, it would be impossible to go back to manual functions, and customers would not stand for any delay in bank services.

“You could set up a table and a teller anywhere to serve your customers, but without their account information, you wouldn’t know whose check was good, which ones to clear, or their current balances. We would have to close our doors in one or two days,” Wallace said. He became sufficiently committed to disaster recovery planning to build a separate building to house an exact duplicate of his equipment for his own backup. This was very costly, so he decided to market the backup site and called it First Recovery.

Peoples Bank of Newton, North Carolina, was using Automatic Data Processing (ADP), an outside vendor, for the bulk of their processing. The check capture and processing is done on site. ADP uses Burroughs equipment. They also process checks for another smaller bank in North Carolina. ADP had an alternate site at a bank in Virginia that People’s Bank used as a backup. The Virginia Bank subsequently left ADP, which left Peoples Bank without a backup. David A. Hunsucker, President of Peoples Bank was particularly concerned about the 24-hour window for processing checks.

One incident emphasized their vulnerability. A backhoe dug up some AT&T phone lines and Peoples Bank lost their line to ADP. This was at a time when they needed to process and update their files for the captured checks. Time became so critical that they were making arrangements for James Saunders, Assistant Vice President of Data Processing, to fly to ADP headquarters in Cherry Hill, New Jersey to process their checks. Luckily the lines came back up just in time for them to process.

Following that incident, they decided to sign with First Recovery for backup and to run a mock disaster to test their plan. They went to Troy N.C. with a day’s worth of checks. They installed the system, sorted and captured checks, and transmitted to ADP successfully. However, the test run was not without its problems. ADP had changed the password for the dial-up modems and they couldn’t connect to ADP for approximately 45 minutes. The other problem was when they ran the sort for the checks, the check sorter had a bad card and was sorting the checks into the wrong pocket. This took about an hour-and-a-half to correct. How long would it have taken in an emergency situation? Possibly it might have taken 5 to 10 hours. This is why it is important to test your plan. The problems are easier to resolve when there is less pressure.

Commerce Bank of Chesapeake, VA, has changed from using a service bureau to in-house processing. This brought great concern about being able to continue their services in the event of a disaster. Keith Horton, Data Processing Manager, indicated that the Kirschman Group provided their software and initially the bank was being backed up by them. However, another bank had a disaster and had to use the backup site. It was then that the Kirschman Group found out how difficult it was to provide this service. The Kirschman Group therefore contacted HOTSITE[tm], a division of CompuSource, to help them provide backup for their clients. HOTSITE[tm] has IBM mainframe hardware. Commerce Bank subsequently signed with HOTSITE[tm].

Peyton Bowden, Executive Vice President of Commerce Bank, had committed to senior management to complete the bank’s disaster recovery plan within two months. They scheduled a test at the backup site, and the outcome is outlined in the table[at the end]. Although they had problems, overall the test was a success. Their plan includes emergency contact numbers for the resources that they may need in an emergency. In addition, their procedures are organized such that with the help of their backup site, they can recover without too much interaction from the bank. Consequently, the bank can recover even if a key person is missing.

Bowden also indicated that among his reasons for committing to disaster recovery planning was that the FDIC, during their examination, had asked other banks to show them their completed Disaster Recovery Plan. This, along with the commitment to maintain services to their customers, were two good reasons for developing and testing the plan.

We have seen that disaster recovery planning continues to be on the upswing. Take a moment and think what would happen to your bank if you had a major fire or other damage at your main location and/or your computer facility. Would you survive?

This article was published in the April issue of Bankers Monthly. Richard Arnold is the publisher of the Disaster Recovery Journal. Melvyn Musson is an Assistant Vice President with M&M Protection Consultants.

This article adapted from Vol. 2 No. 2, p. 6.