Disaster Recovery Planning in a
VM Operating Environment
By Gary R. McClain
VM or Virtural Machine is an IBM operating system that runs on large scale IBM computers. Even though this article is directed to
those who have VM, the methods and the thought process is the same for all systems.
DEFINING DISASTER RECOVERY
The need for a disaster recovery plan, in the context of organizational computing, is generally understood in terms of natural disasters. The term tends to conjure up images of fires, floods, and beyond the realm of Mother Nature, nuclear holocaust. Disaster recovery is really a rather nebulous term, and is probably about as close to having a universal definition as user friendly. The definition depends on ones perspective, with a developer looking at the problem much differently than a user.
It does not take an especially lengthy amount of experience in the VM environment, particularly where there are many users involved, to develop a broad view of disaster recovery. Mother Nature becomes a very mild threat when compared to the disaster potentials of inexperienced users, combined with a technical staff that is also inexperienced in providing the safeguards needed to protect users from hurting themselves. Increasingly the function of disaster recovery is not so much a concern for the event of a hardware problem as it is a protection against user error. If a user erases a file inadvertently, or deletes a large portion of a database, there must be a means of bringing this back with minimal disruption. Disaster recovery might be better described as end-user disaster recovery.
The terms related to disaster recovery, except in the most extreme cases, are very site-dependent. An extreme example of a disaster is one in which the Data Center has been bombed. Whether the site is large or small, this qualifies as a disaster. The gray areas are situations such as when string of DASD is lost. At a large installation, this string may not contain critical information and may be a nuisance to the users involved, but not a disaster for the organization. At a smaller installation, or in a department, where that string might constitute all of the DASD on the system, the loss of that string is a disaster.
The means of protecting the organization against the event of disaster, however this is defined, is through system backup. System backup can be defined as an installation-driven transfer of data to a different medium, from DASD to tape, for example. Backup is designed to provide the installation with complete coverage in the event of a disaster, such as a hardware failure. A schedule for performing backups is determined by the organization. In addition, backup implies protection for the entire organization, rather than one user backing up his own data. At the same time, backup also protects each individual user, insuring that if data is lost there is a means of restoring it. A provision for backup is a consideration among all operating systems, not only VM, though operating systems vary in terms of their provisions for this need. System backup implies making a copy of everything that is on the system so that if there is a loss of data the copy is available for restoration. VM has a built in facility for system backup which, depending on the complexities of the environment, may or may not be adequate.
System backup has generally been thought of as a tape management issue, with DASD being backed up to tape and then placed in storage. While this has not disappeared, backup to DASD is also becoming more common for many reasons. With advances in technology, DASD is becoming much less subject to disaster and thus much more reliable as a storage medium. DASD is also becoming less expensive, and does not require the extensive manual intervention and physical storage associated with handling massive numbers of tapes. Of course, DASD cannot be stored offsite, and in the case of a smoke and rubble disaster, transporting DASD may not be possible or desirable.
The disaster recovery plan can be a very complex plan involving the storage of twin tapes, backup to DASD, and an alternate site in case of a major disaster. Or it can simply involve a periodic backup to tape using the capabilities of native VM. The question involved in making this decision is based on how important the data is to the organization as a whole. If corporate financials are stored under VM, for example, then the potential cost of losing this data will outweigh the costs involved in protecting it.
Disasters, natural or otherwise, occur when they are least expected. Yet the person in the organization responsible for making decisions about preparing for the possibility of disasters all too often performs this function with an It wont happen to me attitude. Finding the time and resources, particularly for organizations with competing user demands and a two year backlog of applications, does not seem as critical as getting through an average week. The short term demands often receive the attention even when you are faced with the probability of a future disaster. People dont generally take the time to buy their own gravesites either.
With the increasing number of organizations moving towards computerization of all vital information, preparing for a disaster has never been more important. Disaster recovery is a necessary business expense. In many industries the organization is responsible for information lost during a service outage, banks being an example. Insurance companies, realizing the importance of recovering lost information management capabilities as soon as possible, may give premium reductions to organizations that can prove that they have a workable disaster recovery plan. Disaster recovery has moved from being a means of getting the auditors off our backs to a key issue for the entire corporation.
Planning for a disaster seems like an unnecessary academic exercise until considered, not so much in relation to the possibility of hardware problems caused by a natural disaster, but instead the probability of users committing errors such as erasing an important file.
Events such as this must also be planned for by insuring that adequate data recovery is available. Recovery implies the need for system backup, an important element of the disaster recovery plan. System backup can be a nuisance, and it is easy to forget, but it is a crucial piece in providing comprehensive protection for the VM environment.
Achieving adequate disaster recovery preparation begins with a plan that is designed with both the needs of the organization as well as the capabilities of VM in mind. Development of the disaster recovery plan can be thought of as a five phase process beginning with an assessment of the organizations needs and ending with testing the plan after implementation. The phases are:
Identify the Issues
Testing the Plan
Phase I - Identify the Issues
The first phase in the disaster recovery plan can be summarized as; identify all systems, define the terms, know the exposures.
Protecting the VM environment against the occurrence of a disaster must begin with a consideration of which systems must be protected, questioning end users as to which systems are critical. An inventory of all of the systems that the users are currently using must be performed, and this may in fact be the only time that a comprehensive overview of all organizational capabilities has occurred. This is necessary to narrow the users focus, to limit them to perhaps a choice of three out of five systems. These systems might include groups of specific applications or a database management system. It is only human nature to assume that every system is critical if given the opportunity to make this choice. Obviously, there must be a provision for backup for all systems, but in the event of a true disaster, some systems are more critical than others. In addition, depending on departmental functions, some departments are also more critical than others.
The process of performing an inventory of all organizational systems often leads to a need to perform some type of documentation. It is not uncommon to discover that much of the day-to-day functioning, for example, is based on the knowledge of one or two key individuals. What should be documented is instead stored mentally. The data center operates status quo until there is a breakdown. To adequately protect the functions performed by an application it may be necessary to develop written procedures which can then be stored offsite, in case a key individual is unavailable at the time of recovery.
A second result of taking a comprehensive look at systems may be finding that as many as 75% of the applications currently being used are not really critical to the survival of the organization during a disaster recovery period. Though relied upon by users, heavily used applications may still be placed on the back burner until the most necessary concerns are met. Alternatives may be available which can be used until the system is ready to support less critical applications, even if this means developing manual procedures. In the event of a severe disaster that involves extensive damage to the corporate data, it may be necessary to temporarily hire added clerical help until the system is ready to bear the burden of meeting all demands of the corporation. This interim processing procedure can be an important outcome of the overall systems inventory. It does not need to be a detailed document, but rather a set of guidelines.
The initial disaster recovery planning needs to be task-oriented, outlining what is involved in recovering the major functions of the organization as soon as possible. The technical aspects of the VM system need to be a specific focus of this initial planning (Looking at what must be done to get the system up and running and the critical data available). The other benefits of this phase of planning, including guidelines for manual procedures, are secondary to the focus on tasks involved in recovering the most critical VM capabilities.
In any case, THE DISASTER RECOVERY PLAN NEEDS TO BEGIN WITH A DEFINITION OF WHAT CONSTITUTES A DISASTER FOR THE ORGANIZATION. This may lead to the development of hierarchies of disaster and systems levels. Examples of disaster levels could be: Level 1 - The computer room is leveled. Level 3 - The nationwide network is down but local operations continue. Level 9 - A user has destroyed a departmental database.
Associated with these disaster levels are groups of procedures based on the severity of the disaster, as well as how important the specific lost functions are to the organization. For example, a lost database in the finance department might be more immediately disastrous than a lost mailing list. Because of this, system priorities should be associated with each system in the organization, based on the initial inventory. Examples of system priorities are: Priority 1 - Requires immediate and full recovery. Priority 3 - requires full recovery, but may be deferred over a 30 day period with the assistance of temporary clerical help.
THE KEY TO MAKING DISASTER RECOVERY LEVELS AND SYSTEM PRIORITIES WORK IS TO KEEP THEM AS SIMPLE AS POSSIBLE. Defining 40 disaster levels and 60 system priorities, and identifying them with complex codes will undermine any potential benefits of the system by destroying its simplicity.
Potential exposures also need to be defined in Phase I of the planning process. It is important to get a feel for where the major exposures are located, e.g. which areas are most critical and must therefore receive the most protection. If the corporate financial system is a top priority, then this will be an area that will have to be backed up often. If this information is located on a departmental system, then this department will need to institute extensive backup procedures. Those systems that present the most exposure will also need to have guidelines for alternative procedures associated with them to assure that in the event that the system is damaged, the function will continue to be performed. If the system in the finance department experiences difficulty caused by users or nature, employees will still need to receive paychecks.
Phase II - Hardware Requirements
Hardware requirements are the next part of the VM disaster recovery plan. These requirements can be summarized into choosing a professional alternate site, or reciprocal agreement, and examining device type dependencies.
Hardware considerations are the foundation of most disaster recovery plans, because a natural disaster, causing a power outage, will affect the hardware first. Once the most critical systems are defined in Phase I, it is important that a means of restoring these systems as quickly as possible be developed. The alternate site is a widely-used means of accomplishing this.
The alternate site, also called a hotsite, is a fully-equipped computing facility that provides organizations with the capability of taking what is essentially a copy of the VM system, and dropping it in. The alternate site is then ready to begin processing the essential functions if for some reason the organization experiences a loss of computing power. There are many companies offering alternate sites, some more equipped than others. Also, some timesharing companies offer this service on the side. With a well-organized backup system, so that the alternate site is kept up to date with the essential data, the alternate site can potentially be a means of achieving very fast recovery in case of a major disaster. It is important to consider the location of the alternate site when choosing one. If it is located in the same geographical area as the organization contracting its services, and thus experiences the same disaster, there is the question of which group receives priority during the recovery process. It is also important that the alternate site have provisions for stringent security, as most likely very sensitive data is being stored. Adequate testing of the recovery procedure, also a consideration, is discussed in Phase V.
An alternative to the hotsite is a cold site. A cold site is a facility that provides an environment suitable for the installation of a computer and associated hardware. This environment generally includes a raised floor, air conditioning, and power supplies. The subscriber is then responsible for contacting the hardware vendor for acquisition and installation of needed hardware to resume processing activities.
A severe disaster might also result in a loss of documentation. This must also be stored at the alternate site. On a daily basis, only a few pages of a technical manual might be used with any regularity. But these few pages may be crucial, and in the event of disaster, attempting to construct them from memory will be impossible.
In addition to the alternate site, the buddy system is an option for the disaster recovery plan. This involves signing a reciprocal site agreement with an organization that has a similar configuration. Each group agrees to be a buddy to the other, providing alternate processing in the event that one of them experiences a disaster. This can be an inexpensive and viable option providing that both organizations are compatible, and agree to be available to the degree necessary. Organizations with departmental processors in remote locations have the added option of designing an intra-organization buddy system, with a processor in one part of the company prepared to serve as a backup to a processor in another location. This achieves the same level of protection as a buddy system, with added commitment and security. Departmental disaster recovery is discussed later in this article.
There is also the I am my own best buddy system for disaster protection. This is a complete redundancy system. Some installations are so specialized or large that commercial recovery sites are of no use. Therefore, they have their own disaster recovery centers. This is expensive but may be necessary.
As implied, without adequate backup, the alternate or buddy site is really nothing more than a collection of blinking lights. If data is not backed up and verified for useability on a regular basis, then the disaster recovery plan truly is an academic exercise. Generating twin backup tapes, either through the use of system software that accomplishes this simultaneously, or by manually copying tapes, is a means of offering an added level of protection. One copy of the backup tape can be stored within the walls of the organization to assure restoration of information in case of user error, while the other can be stored at the alternate site. It is important not to neglect the onsite backup requirements while focusing on the alternate site. If all backup tapes are sent offsite, timely restoration may be dependent on how quickly tapes can be expressed from the alternate site.
In choosing either an alternate or a buddy site, device dependencies should not be ignored. For example, hardware compatibility, including not only the CPU but other devices such as tape drives and DASD, is critical. The alternate site must support the same density of tape drives, or provisions must have been made to have tapes recopied to the correct density before sending them to the alternate site. The alternate or buddy site does not necessarily need to be a mirror image of the organizations which is contracting for disaster recovery assistance. Again, if the essential systems have been designated as having first priority, then the alternate site need not have as large a CPU as the organization it is backing up. In fact, rarely is an exact mirror image even possible. What is most important is that the operating system and performance requirements can be met by the backup site if disaster occurs.
Personal computer requirements should not be ignored in disaster recovery considerations. When the personal computer wave started, organizations purchased many of these machines without regard to compatibility. Because of extensive use in end-user computing, personal computers also need to be available, and important data backed up periodically and possibly sent offsite. Actually, the average personal computer user is most likely much more aware of the need for backup than is the average mainframe user because they are much more involved with the processes of the machine. The ideal situation for the inclusion of the personal computer is to connect these machines to minicomputers, upload the data and then back it up from the host. Corporate standards for backup will facilitate this process.
Phase III - Software Requirements
The software considerations of the VM disaster recovery plan include the following:
* System synchronization
* System backups
* Technical support
Coordination of release levels are an important element in planning for the use of a backup system, whether this backup is a processor located in another department or remote location of the same organization, or whether it is at an alternate site. Software products have maintenance levels and some older releases are no longer supported. Applications running under a new release may not be useable, or restorable, under an older release. For example, database systems can be very sensitive to differences among releases. Data loaded under a new release, with applications designed around the enhancements of the most recent release, may not be useable if restored under an older release. In the case of logical databases scattered around over various departmental processors, if the database management system loaded on these processors are not of an identical release level, data moved from one machine to another may not be restorable.
System synchronization is another important planning consideration. This concerns systems or applications that have multiple components that must be backed up at the same point in time. Database management systems are also a good example of this need. A relational database management system may have data that is stored on five minidisks. If a backup is performed while the database system is actually running, implying that users have update capability, and the 5 minidisks are backed up at 10 minute intervals, a personnel record which spans 2 minidisks and was updated during the backup may not completely exist. When restoring these five minidisks, the database management system might not run, or it might fail when encountering the personnel record that was updated. User applications that involve multiple files may have the same synchronization problems.
The backup system should be organized around the needs of the organization. Some data does not need to be backed up often, and it may be adequate to back it up every few days. In the departmental environment, backups can be performed on a departmental schedule, with some departments backing up more often than others. The key considerations that underlie whatever schedule is chosen should have the goal of achieving the recovery of whatever the organization has decided is a minimal configuration on VM in the event of a disaster. The needs of individual user groups and departments are really secondary to this consideration. The systems deemed most critical during Phase I need to be the first considerations in the backup plan. If this information can change daily, then it must be backed up daily if the ongoing efficiency of the organization is dependent on this timeliness.
The issue of full versus incremental backups becomes a major consideration when deciding how organizational data must be backed up. The larger the organization, the greater the impact of this decision. A full backup, or dump, backs up everything in the system. This can be a very time-consuming procedure as well as requiring large amounts of physical storage space for tapes. An incremental backup, on the other hand, backs up everything that has changed since the last backup. Data is backed up in increments, based on either the last system backup, or the last incremental backup.
System software is necessary to facilitate this level of specification, particularly if the incremental backups are to be based on dates. An incremental backup requires less time and resources than a full backup, because only what has changed is being backed up. However, incremental backups lengthen the recovery process because data is spread over more tapes. Also, performing more tape mounts increases the potential of operator error.
An effective ongoing backup procedure will most likely require a combination of both full and incremental backups. A smaller installation may find a nightly full backup sufficient, but a larger system may not be able or may not find it necessary to do a ten to twelve hour backup nightly, so this procedure might occur one or two times per month. During the intervening time, only what has changed since the system backup can be captured through a nightly incremental backup.
Organizations relying on native VM for system backup, without the aid of system software, can develop a system of both incremental and full backup. This would most likely involve the use of CMS EXECs in combination with TAPE DUMP and DDR.
The window of time available for full dumps at the average VM installation is most likely diminishing. VM is probably accommodating many users and applications, with large and growing amounts of DASD. It is unlikely that there is time available on a regular basis to shut the system down for long periods. Incremental backups certainly help in this process, but these must also be scheduled prudently. Incremental backups still require more tapes, and more tape mounts. Thus there are personnel requirements, storage requirements, and by implication, the possibility of human error in tape handling and labeling. Careful record keeping in tape storage helps with this process, but there are other issues. It may be necessary to reconstruct a minidisk as of a certain date, or a user may wish to restore a specific file, but not everything on a minidisk. Human error is also introduced in these processes.
Getting back to the definition of the critical systems for the organization, at the core of these systems, and therefore the most important data to be backed up is an IPLable VM system (a basic VM system that may be brought up through the Initial Program Load). A disaster dump tape must always be available that contains enough portions of the system to make VM available once that portion of the system is restored. These areas include the following:
* System residence volume
* CP nucleus
* Directory areas
This will provide a minimal configuration on VM, which will in turn be a basis for restoring the critical applications.
Phase IV - Personnel Requirements
The fourth phase of the disaster recovery plan involves personnel considerations, including:
* Administrative logistics
* Technical coordination
Administrative responsibilities for the disaster recovery plan are really site dependent. They include issues that may easily slip through the cracks in the planning, such as transportation of backup tapes to the alternate site, and coordination of staffing requirements locally. It is not uncommon for the organization to be in a panic when a disaster hits, and attention to administrative details before the fact can assure that the necessary details and responsibilities are carried out. It may be helpful to categorize responsibilities locally, in transit, and at the remote site. The same level of attention should be given to the administrative details necessary in less catastrophic disasters, such as who is responsible for recovering data inadvertently destroyed by a user.
Technical coordination implies the mobilization of personnel who have been trained in the procedures necessary for carrying out the disaster recovery plan, and assuring that the minimal configuration of VM is available as soon as possible. Individuals involved in technical coordination should have the expertise to recover all hardware and software aspects of the systems designated as being priority. Technical persons involved in operating systems may not understand what is involved in restoring a database, or a network. A guru in operating systems may not be helpful with other aspects of recovery.
To facilitate this coordination, a single disaster recovery coordinator may be appointed, with team leaders reporting to this person in case of a disaster. This person might be located at the central site, with each department having a designated team leader. People are the key element in whether disaster recovery is a success or a failure, and selection of this team, whatever the basis for its composition, is important. Responsibilities for each team member need to be clearly defined. At many installations a single team member may have a variety of responsibilities continue after the disaster, such as notifying vendors that the disaster occurred, and placing customer service teams on alert to assure that customers experience a minimal reduction in service.
The critical point in designating responsibilities is that recovery occurs as soon as possible without excessive concern for standard organizational chains of command.
Phase V Testing the Plan
During this phase, the disaster recovery plan is tested with goals that include evaluating the overall effectiveness and adjusting the plan to changing requirements and personnel.
This phase is really an ongoing process, because provisions for disaster recovery, particularly as regards disasters, must be tested regularly. Testing is most effective if it is unannounced. This is not to say that overzealousness, pulling the plug on the total system, is in order. In a network situation, with remote sites, this would truly be a disaster. It is better to choose one system, prepare ahead, and test it in such a way that the whole organization is not crippled. An alternative to a shutdown is a structured walk-through in which key individuals get together in a room and walk through the scenario and their respective responsibilities. This is similar to a role play, and runs the risk of being no more than a well-conducted meeting.
Scheduling the disaster recovery test, quarterly or semi-annually for example, provides a means of testing reliability. A disaster recovery team can be sent to the alternate site to conduct the test at the same level, based on system priorities. This indicates not only how the internal staff is prepared, but the alternate site as well.
The disaster plan, and backup schedules in general, should be reevaluated on a regular basis. Application and system configurations change often, and so do organizational priorities. If a new system becomes a priority, such as a new financial application, it needs to be backed up as often as necessary and built into the recovery plan. New releases of critical software need to be installed at the backup site.
A disaster recovery plan can take from six months to two years to develop, yet at the end of the development period, much of it may be outdated. Software, for example, changes so rapidly that it may be virtually unrecognizable after two years. Data centers also change drastically over a two year period, with requirements and personnel evolving rapidly. Flexibility to handle these changes should be built into any plan for anticipating disasters, however it is defined in the organization.
Written by Gary R. McClain, VM Software
This article adapted from Vol. 1 No. 4, p. 18.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.