Data Processing Recovery (46)
Many companies have a significant investment in large mainframe systems and communications networks. Over the past five to ten years, major efforts in contingency planning have provided reliable, secure and automated disaster recovery plans for the corporate data stored on the mainframe platform.
However, just as we see the emergence of mature, well tested plans for mainframe recovery, much of the critical data in large organizations is moving to the personal computer. Will this require years of effort to develop totally new processes for the backup and recovery of PC data? Or, can we benefit from past efforts and use the mainframe for disaster recovery? This article will explore the need for effective backup of all the important data across the organization and a new solution to enhance existing disaster recovery plans to cover valuable data stored on micro computers.
Survey Results from the First International Disaster Recovery Symposium
& Exhibition, Atlanta - September, 1989
Approximately 450 surveys were distributed to the attendees at the conference. The following is the outcome of the survey analysis.
Corporate Crisis Management is based on a solid program of information protection, strategic business continuity planning, and identification, organization, and protection of communications links.
Each element of corporate crisis management must be addressed as a component of an overall strategy to ensure business survival. Protection of critical information includes the backup of critical functions and supporting data, as well as the securing of sensitive information. Business planning recognizes and addresses all critical business components. This includes traditional data center recovery and also the end user office environment, particularly information communications centers. Such centers provide voice and data communications both within a company and to customers. For example, a clerk who answers incoming customer phone inquiries and provides responses to the status of a client account via on-line data base queries uses both voice and data communications. And the third component, communications links, ensures both internal communications channels for subject matter experts, public relations, and executive decision makers to coordinate and assist with potential crises, as well as links to the public to ensure an intentional, accurate flow of information. Of course, the inclusion of the actual voice and data communications links is the key to this component.
During the September Disaster Recovery Conference in Atlanta, AT&T Data Security Services conducted a survey of more than one hundred and forty attendees focusing on data systems security and contingency planning. Organizations participating in the survey represented manufacturing, banking, financial, service, various levels of government, telecommunications, medical, education, research and a variety of other fields. The largest group of responses were from the first five categories. The statistics and resulting analysis that follow will allow organizations to compare their data security and contingency planning programs to the cross section of organizations that responded to this survey.
The situation was tense. Twenty-four tape drives - all wanting to be fed while each of us stood with 10 tape cartridges under each arm unable to answer the call...
This describes the situation in which we found ourselves during a recent recovery exercise. Previous to this, participants of a disaster recovery exercise would notify the Data Storage Technicians which tapes to send to the Recovery Center for a recovery exercise. This notification would normally take place two days before the starting date of the exercise. We had just started a disaster recovery tape rotation program in which our production systems sent their daily recovery tapes to our remote vault on a daily basis. This was our first opportunity to use this new rotation process. We anticipated that this would allow any of our customers to recover back to within the last 24 to 36 hours of the system failure. While the number of tapes sent to the recovery center increased four-fold, (with the increase directly attributed to this new rotation pattern), we had it all planned out -- we thought. No problems!
The process began orderly enough. Each day the Data Storage Technician would receive pick-lists for the several vaulting patterns to be sent to the remote vault. These lists were systematically generated by our tape management system. These tapes would be picked and packed for shipment in numerical sequence, then scanned and doubled checked to verify the shipment was correct.
The tapes were then sent to the remote vault where they were to remain in the cartons, as packed, for a fixed number of days. At the end of the rotation period, the tapes would be returned to the data center as scratch for reuse.
For a recovery exercise, or in the event of an actual disaster, one call to remote vault would send all our recovery tapes to the recovery center. The tapes were already picked, packed, awaiting the call and one hundred percent correct. Everything anyone would need to restore the system to "yesterday" would be in the remote vault.
When we arrived to conduct our recovery exercise, we had approximately 35 of these containers, each containing about 100 tape cartridges, waiting for us. Each of these 35 containers was unpacked and its tapes separately placed in the tape racks. The restore began, tape drives began calling for tapes, and about 30 minutes later we knew we had big problems.
It was somewhat inconvenient to look through 35 different little stacks to find a tape. We did some grumbling; however, when the drive was finished with the tape and unloaded the tape to be put back in the rack the blaring question came out, "Where did the tape go?" Since we had not made any kind of notation on the cartridge as to which little stack it came out of, we had no idea where to re-rack the tape. As used cartridges began to accumulate here, there and everywhere, a couple of suggestions were made.
We knew we had to put the tapes in some kind of order to avoid being buried in used tapes. We began an ordeal that extended some 14 hours where two and three members of our recovery team hand- sorted the tapes into one numerical sequence while attempting to keep up with more than 20 hungry tape drives. We had our hands full of tapes picking, racking, and stacking.
At the end of the 14 hours, we had successfully sorted the tapes into one ascending sequence. We were an exhausted but happy little group. The remainder of the exercise was somewhat uneventful at the recovery center. Oh, we had a couple of missing tapes due to mis-entries into the tape management system, (a learning point for our TMS folks); however, we were quite pleased with our progress. Then the second shoe fell.
When the time came to pack up and go home, we realized that we were not prepared to undo the 14-hour-hand-sort-routine. We began picking through all lists that had been cast aside as not being needed later in the exercise. Not all of the lists were marked as to which container they belonged and not all containers had contained pick lists. We found ourselves trying to guess which tapes went with which container. Six hours later, we felt pretty good. After all, we only had about 75 tapes left over.
We had no idea what the tapes contained or into which containers they should be packed.
We did the only desperate thing, we paced the leftover tapes in the leftover container and shipped it back to the remote vault, keeping our fingers crossed that we would not need them before the rotation period expired.
Somehow we survived what turned out to be a valuable recovery exercise. Vowing never to let this happen to us again, we returned home with a strong personal resolve to fix the problem. After asking a few questions we learned that the file used by the Data Storage Technicians ot pull the tapes could be downloaded onto a diskette. We could create diskettes for each day's tapes going to the vault. We would be sure that each container held not only tapes, but also a diskette and a pick list of those tapes.
Learning the record layout of the file, we used spread-sheet software to write a system of macros to read in the data from the diskettes, build a collective list of all tapes, sort the tape cartridges out in numerical sequence, (keeping track of the container from which they came) and assign slot numbers in the tape racks. We also had to allow for those little things that spring up such as, "Oh, by the way, where did that container come from? It wasn't there before." It worked! We reduced our time to rack the tapes from 14 hours to a little over three hours. When it was time to pack and go home, we had two options, either do a reverse sort, or use the original packing list.
Over the past couple of exercises, we have refined and improved our process into a smooth procedure. With multiple hands helping it became easier, faster and much more peaceful to manage our daily contingency tapes. Looking back, we can smile, or rather laugh at ourselves. Now, we have a sense of achievement.
All the tapes are racked and ready to go before the restore process reaches that point of call. One more learning point on the recovery path had been mastered.
We feel good that we met, we recognized and we conquered this rude and startling issue in an exercise rather than first meeting it in a real disaster.
Gary G. Wyne, CDRP, is a Business Continuity Planning Coordinator with Eli Lilly and Company.
When companies plan to deal with events that can destroy a business -- fire, flood, hurricanes, etc. -- the plan often puts information processing among the top priorities. And with good reason.
Many businesses simply can’t run without their computer systems. For them, information processing is mission critical -- a vital part of keeping the doors open. Consider a mail-order house, insurance claims processor or any management information service. In each case, providing service depends on computer power; without it, the company is helpless.
That’s increasingly true of most businesses, even those where information processing appears less critical. What would the impact be if you lost your computer for four hours? A day? A week? How much business would you lose over that period? Even more important, how much future business would you lose if customers turned to another supplier while you were out of commission, and never returned? Consider hidden costs, too, such as customer dissatisfaction and loss of confidence in your ability to deliver at critical times.
It’s not surprising that the life expectancy of a business without its information processing system is measured in days. Almost half the enterprises that are struck by a business-stopping disaster never reopen, and an additional 29 percent go out of business within two years.
How can you beat the odds? By putting information processing at the top of your disaster recovery plan. But as with any business service, it pays to do some research. Not every company that offers disaster recovery service for your computer system is prepared to deliver what you need. Consider these points:
- Configuration. Ensure that the service provider will supply a computer configured to match your existing system. It must function identically when running your mission-critical applications in order to get you up and running again. And it should arrive with the most recent operating system pre-loaded and ready to run.
- Response time. How fast will the computer reach you? Without a disaster recovery plan, you face two options: locating a used computer that will meet your needs, or asking the manufacturer of your computer system to send you the next new one off the assembly line. In most cases, count on either option taking weeks -- weeks during which your company may be withering on the vine.
- Time to resume processing. Keep in mind you don’t just need a box; you need it up and running so your business can continue. That’s your goal. Some providers will send a machine to you with lightning speed, but that only solves half the problem. Then you need to load your software and data and make it run. Focus not just on replacing hardware; focus on ensuring continuity of your business.
- A turnkey solution. Obviously, your information processing needs are just a small part of what it will take to resume business operations in the face of a disaster. You will face 1,000 other questions as well: personnel, facilities, insurance claims, etc. At such a time, it pays to be able to make a single phone call to arrange for the exact equipment you need, with a field engineer who will install the system on site. Let your disaster recovery vendor handle that burden, freeing you for other concerns.
- Cost. Traditionally, the cost of a disaster recovery service for information processing has run high: up to ten percent of a company’s annual data processing budget to set up a plan, and one to 2 percent per year to maintain it. But for many smaller companies, the figure can be brought down significantly. The key is to find a provider -- often a hardware manufacturer -- who’s able to leverage existing investments. A company that already operates service response centers, and one that specializes in your type of equipment and operating system, can probably offer disaster recovery services at minimal extra cost. Unlike a third-party service provider, it doesn’t have to maintain 20,000-square-foot data centers across the country with machines in dozens of different configurations.
- Insurance considerations. A disaster recovery plan is strongly considered when an underwriter develops an overall business insurance policy. How prepared a business is to recover from a disaster will clearly be reflected in the premium. Check with your insurance provider for more information. Don’t make the mistake of assuming business interruption insurance takes the place of good disaster recovery planning. Insurance can replace some immediate lost income, and reimburse you for physical damage. But it can’t keep your business whole. Replacing lost income for a week or two won’t ensure the future health of your company.
- Testing. To ensure the plan will work when it’s needed, your provider should offer regular tests of the hardware and software that would be rushed to your doorstep in the event of an emergency. Arrange to take backup tapes of your data and load them into the system, then attempt some routine processing. This is particularly important as the company upgrades the hardware it maintains on your behalf, and new releases of the operating system hit the market.
The first step, of course, is designing an overall disaster recovery plan. As part of that plan, identify what parts of your business must be running at any cost. That usually includes recovery of your information processing systems, and your system provider is a good place to start planning what disaster recovery will entail for you.
Today, only one-third of Fortune 500 firms have disaster recovery plans. Your company should definitely be one of them.
Bob Simon is Manager, Professional Services, Applied Digital Data Systems, Inc. in Carlsbad, Calif.
Information is increasingly recognized as a critical business asset. The protection of information used in today’s corporate environment, either through the use of access control processes or secure storage solutions, has increased in visibility and is becoming a real challenge to information systems managers.
In the past, critical business applications lived in the glass house data center. Sophisticated protection schemes evolved to insure the viability of this important information. Today’s fast paced world of client server computing is driving ever increasing amounts of data processing outside of the glass house; however, the information on distributed systems is no less critical to the organization than that in the data center.
Preparation for the recovery of client server systems should receive the same amount of attention that the recovery of core mainframe business applications has traditionally received - perhaps more, since the task of protecting a distributed computing environment is much more difficult. The geographic dispersion of end users and the anarchistic tendencies of LAN administrators make a disaster recovery plan you design for a distributed environment a nightmare to implement and test.
In order to protect the visibility of any computing environment you must successfully backup and store the information in use in the environment. At the very least, every organization should have a well thought out, documented plan for the backup and storage of all information critical to the health of the company.
Will the recovery really work? The issues related to answering that question are indeed mind-boggling. Providing fast, complete backup and recovery in today’s 24X7 processing environments for both local and disaster outage scenarios across multiple applications and data types is of paramount importance. Downtime windows of 8-12 hours each weekend to create full-volume dumps are becoming unworkable. Incremental backup failures each night are becoming more common. The use of Aggregate Backup and Recovery Support (ABARS) will provide the necessary function required to address these concerns.
Many kinds of backup tools are currently used in today’s environment, each intended to address different recovery situations. Volume dumps are intended to protect against HDA failure, incremental backups protect against single data set loss, IMAGECOPIES are needed for online databases, etc.
Although most installations focus on the backup process, the real issue is RECOVERY. Recovery must be cost-effective, streamlined, complete and all-encompassing. Enormous overhead is spent backing up data redundantly with multiple tools and still many recovery requirements can’t be met. Problems traditionally include missing data, incompatible device geometries, data/catalog synchronization issues, huge manual effort and unacceptably long recovery times.
Additionally, implementing DFSMS and its related strategies requires that old, ‘tried and true’ backup processes be re-examined, especially for disaster recovery. Previously, ‘critical’ data was hand placed on certain DASD and dumped for backup purposes. DFSMS, if fully implemented and exploited correctly, completely removes physical device dependencies, with data now existing anywhere in a hierarchy. SMS Managed Tape further complicates this issue. Aggressive migration policies cause volume dumps to miss critical data. Multi-volume data sets cause additional complications. Volume dumps may get a nice return code zero (0) at backup time, but have inherent problems during the recovery.
Restoring lost packs, minidisks, and data files is a critical component of any disaster recovery plan. The occurrences that precipitate the need to restore data are many and diverse, and contingency planners need to be aware that a number of backup and restore approaches are available. The exact approach taken in any given situation will depend on both the extent of the loss as well as the extent to which lost data needs to be restored.
Even so, some companies have not completely explored the various backup and restore techniques available that will let them minimize the time to recover their business operations after a disaster--or even after the smaller data losses that can occur on a daily basis. This is particularly the case in the VM operating environment.
VM has become increasingly popular in recent years because it offers end-users a flexible operating environment, allowing them a high degree of control over file creation and manipulation. However, what IBM did not provide with this system is an efficient way for operations to back up and restore inadvertently destroyed minidisks or corrupted files. The IBM restore utility called DASD Dump Restore (DDR) included with VM is not only slow--requiring about 30 minutes to backup or restore a single 3380/D pack--but also requires technical support staff involvement, particularly if trying to restore a user minidisk or file.
A number of products have evolved to fill the operations need to easily back up and restore lost packs, minidisks and files. But before making a product selection, contingency planners need to understand the various conditions which could lead to a restore requirement. Only then will they be able to know how best to respond to each of these situations and select the backup and restore product that meets their specific requirements.
There are two different ways to back up and restore data--logically or physically. Logical restores assume a “logical” view of the storage media which reflects the existence of the CMS file system in minidisks on that media. Physical restores, in contrast, make no such assumptions and view the media as unformatted with no internal data structure. In practice, logical restores are done on a minidisk by minidisk basis, while physical restores are done cylinder by cylinder.
Logical restores are the method of choice when, through human error, a file or minidisk is corrupted or erroneously deleted. With a backup and restore product, these situations are easily rectified by an interactive full-screen search for the name of the lost file or mindisk in an online catalog, and then by a function key request to direct a software restore.
A logical restore should also be favored if a head crashes. This may happen, for example, in conjunction with a head drive assembly failure--a hardware error that renders a volume on the disk unreadable. Since that volume may contain a number of end-user minidisks, restoring the user pack swiftly and efficiently is crucial to business operations. Again, with a logical restore, the name of the pack is entered from a front-end screen, a function key is pressed, and the volume is automatically restored to its most current status.
Logical restores are recommended in these three situations (file, minidisk, and user packs) because in each case the operating system and backup product capabilities are available. However, if the system pack fails, placing the operation in a standalone mode, a more complex restoration is required. This situation may also occur in the event of a disaster which renders the entire system inoperative.
In either of these cases, since the backup product and operating systems are unavailable, an initial program load (IPL) must be executed from a standalone tape. A physical restore can then be completed after mounting the backup tape and entering a single command. By comparison, a logical backup would require the systematic restoration of each minidisk individually by manually entering separate restore commands for each minidisk from a hard copy file listing.
Recovering from a disaster with logical restores requires significantly more time and steps to complete than physical restores. Pack names must be input before the restore can begin, and all volumes have to be labelled and formatted--a process that can take anywhere from three to 12 additional hours, depending on the expertise of the support staff. Additional time and human involvement may be required to format special system areas.
Under the stresses of a disaster recovery operation, it may not be wise to depend on human efficiency for business critical operations. For this reason, when operating from a hot-site after a disaster in a standalone mode, it is better to conduct physical backups.
Only when DASD capacity is limited at the hot-site backup facility is it better to include logical restores in an overall data recovery operation. In this case, complete a physical restore for business critical packs, and then do logical restores for individual minidisks and/or files on an as-needed basis.
Product performance factors are vitally important to contingency planners--but only in the context of their individual recovery requirements. If recovery windows are large, time may not be critical. But in most cases, disaster recovery plans exist for one reason: to ensure swift resumption of business operations.
To ensure that business recovery can be completed as fast as possible, planners should develop a data restoration plan that reflects their specific needs for logical or physical restores. Only then should they systematically review available backup and restore products and select the one that meets these specified requirements.
Walter Horodyski is the Software Development Manager with Syncsort, Inc.
This article adapted from Vol. 3 No. 4, p. 56.
When you think of "lost data," you might immediately think of a natural disaster such as a fire, flood or earthquake that has completely destroyed your office and has left your computer in ruins.
More than 80% of the damaged drives that have been recovered had suffered damage from simple, common problems that could have been prevented if the company had a good, simple foolproof data backup system. Every drive has a normal life span, and it is always a question of "when" it will fail, not a question of "if".
When a disgruntled employee left a large company, he decided to leave behind more than just a messy desk. A virus was left on the company's computer that slowly and methodically latched itself to all the company programs. Another employee encrypted all of the company data files and left with the only known password. Human error also plays a major role in causing the loss of computer data.
One company was confident that they had a very fail-safe backup system in place only to discover that when their computer failed and they reinstalled data from the tape backups that the only files that were saved were the program files and no data files had been backed up.
Another company employee always backed up each night to a cartridge tape and very carefully used the same tape over and over to back up data onto. Then tragedy hit; the drive failed just as she was backing up the data to the only tape cartridge and all data was lost. You see, the first thing a tape does during the backup process is to erase all the data on the tape from the last backup, and as it was reading data, the computer's drive failed.
A major corporation kept nine sets of backup disk packs with duplicate information on them so as to prevent any kind of data loss. The tenth was kept in a locked vault.
When the computer's drive failed, the technician very carefully got a backup pack and placed it in the damaged drive, found the backup didn't work, got another, and another, then another, still another. Failing all nine packs, he very cleverly got the combination to the vault, removed the final pack, put that one on the drive and actually ruined all ten backup packs due to the fact that the damaged head from the bad drive continued to crash every backup in the house. Almost unbelievable, isn't it?
A firm's secretary was anxious to use a friend's program and was installing it as a "surprise" for her boss. The "friend" told her how to install the unauthorized copy of the software, and she proceeded to overwrite all the data on the computer's drive and destroy all the company's actual data.
A company's drive failed and an employee decided to recover the data himself and proceeded to use one commercial recovery program after another leaving the drive overwritten with computer solutions instead of recovering the actual "lost" data.
When a computer repair company received a customer's drive, the customer asked them to upgrade the computer and transfer the data from the old drive to the new one. The technician placed the old drive on the corner of the table, and it fell to the floor knocking all the drive's platters out of alignment. There were no backups.
When the external disk drive sat happily atop an air conditioner, it happily vibrated itself to the edge of the unit and at the end of each week, the employee would resituate it the center, until he was out sick for a few days and it happily vibrated itself onto the floor.
All of the above disasters could have been prevented if a good, reliable backup system was in place and operating.
There are just a few backup tips you need to know about when instituting a backup plan:
1. Use a simple, reliable method. One that verifies its backups easily and is easy to use. An inexpensive hard disk backup utility such as Norton Backup Version 1.2 or Central Point Backup Version VI is fast and easy to use. It will back up your data to floppy disks or tapes and verify the integrity of the backup.
2. Whatever backup method you use, always have more than one set of backups. Don't use the same tape to backup data to each time, or even the same set of floppy disks. Realize that when you backup, you are overwriting to that media at the same time.
3. Always have a second method on hand as a secondary backup. If you are using a tape backup system, try backing up important data files to floppy about once a week as a secondary backup just in case the tape fails.
4. If your computer is connected to a large network computer, take a few minutes each week to make simple backups of those important data files to you.
5. Always include an off-site backup copy to your backup procedure. Once a month make a backup of your computer and take the floppies or tapes away from the office. When the backup is stored at the same location as the computer, it is also equally at risk when a disaster strikes.
6. Have a firm company policy about bringing in software from outside sources into the office. You can run anti-viral software on a computer in just a few minutes that can isolate and destroy many computer viruses. The Norton Anti-Virus is a good investment in preventing a whole lot of computer troubles.
7. Never make changes to a computer without backing it up first. That includes both software and hardware changes. Don't depend on a computer service center to save your data when upgrading your system. When taking your computer into a service center, back up your data first.
Carleen Bridgeman is employed at Data Retrieval Svcs., Inc., Clearwater, Fla.
Most organizations today are so dependent on the operation of their computer facilities that loss of processing for any period of time is intolerable,” stated a Datapro Research report. This is especially true of enterprises running “mission critical,” online transaction processing (OLTP) applications that provide a constantly current record of their businesses.
OLTP applications provide authorized users with the ability to immediately read, change, or delete information from any location in a network. Examples of critical applications in finance include automated teller machines, electronic funds transfer, point-of-sale, and securities trading. In retail, critical applications include online credit authorization, warehousing, and distribution. Critical telecommunications applications include telemarketing call centers, 800 numbers, and emergency 911 services. Critical manufacturing applications include work-in-progress tracking and just-in-time materials delivery. These are just a few examples of how OLTP is increasing its role in a wide variety of industries.
The trend toward using online information to run an enterprise is rapidly spreading because it provides current data on which to make management decisions, enables provision of better service to customers, and improves intra- and inter-company communications, helping an enterprise gain and maintain a competitive advantage.
From a study of several industries (public utilities, finance/banking, insurance, manufacturing, and the services), the financial and functional impacts of loss of computer service were reported by the Center for Research on Information Systems at the University of Texas at Arlington, shown in Figure 1. In this study of 160 firms, the typical company can expect to lose almost 25% of its average daily revenue by the sixth day of an outage. The estimated loss rises to 40% of the average daily revenue by the 25th day. In another study by Datapro Research, of the enterprises that sustain a major disaster, 43% never reopen and 29% close within two years.
In a disaster, lost data is one of the irrecoverable elements. For online applications, lost or corrupted data can eliminate chances of a complete recovery. OLTP users face the greatest risk from a disaster because their critical business functions depend on up-to-the-minute data. Thus, if computer service becomes unavailable, enterprises running mission-critical applications would be unable to conduct their businesses. For this reason, along with pressures from legal departments, auditors, and sometimes government regulations, these form the driving motivations for comprehensive business continuity planning.
THE CHALLENGE - BEFORE THE LOSS
The previous two decades have marked the beginning of a new business age. Information technologies and business operations have become very dependent upon electronic data processing, magnetic media, telecommunications and supporting documentation. The development of these technologies presents a new challenge to EDP users. When disaster strikes in the form of fire and flood, a company’s electronic-based information assets may be placed in serious jeopardy. A timely and coordinated physical recovery plan can make the difference between a manageable, short term suspension of operations and a devastating business failure. The timely application of innovated state of the art electronic equipment restoration services will often limit and mitigate both property damage and business interruption losses. Case histories have proven that over 80% of smoke and water exposed electronic equipment can be successfully restored to a pre-loss condition, typically at a cost less than 25% of the comparable replacement costs. A restoration process carried out, by dedicated specialists with a well defined sense of urgency, can be completed in several days while replacement of smoke and water exposed equipment can take several months and involve expensive, time consuming reengineering.