User files, comprising most of the data volume for IBM mainframe systems, can be further categorized as production or test files. The DASD space associated with these two categories contains data, programs, work areas, test files and temporary files.
Three areas represent the data classifications within most systems (see Figure 1). Systems data is related to either the operating system or program products. Although these “components” are often very stable and free of frequent change, they must always be backed up.
User-libs is another low volume classification. It is also relatively stable. This classification includes source code, control cards, executable libraries, JCL, parameter decks, reference files, etc. This data is critical and must be backed up frequently. Note that some of this data can be recreated from other data. The executable library can be recreated from source code, but that time is unavailable during an emergency.
The user-libs classification also contains testing data. This data is not critical to daily business processing. The executable libraries need not be backed up so frequently because they can be rebuilt from the production executable library and source code. On the other hand, the source code libraries must be treated as “semi-critical” because they represent work which can not afford to be lost.
The final classification is User-data. User-data represents the largest volume of data in your shop (see Figure 2) and is extremely heterogeneous, causing the biggest problem when preparing for a disaster recovery. Two trends of data center growth have been predominant in the past ten years. First, there is more on-line processing required. Furthermore, data storage volumes have grown significantly. As on-line systems become more predominant, the need to access data approaches 24 hours a day. As the volume of data grows, more time is required for backups and the window for batch or on-line production shrinks. These two primary growth trends are in direct conflict for the same resources.
The question is where to draw the line between data that must be copied and saved offsite, and that which is derivable, non-essential or of no value. (The of-no-value category holds a lot more data than anyone would tend to believe. Duplicate data, out-of-date copies, empty space, etc. all fall into this category.)
THE DBMS (USER DATA)
Within the user-data category is the data under control of a DBMS (Data Base Management System). This data is different because the DBMS supplies utilities and procedures to restore the data base in case of a failure or catastrophe. A DBMS can be restored to a current point (i.e. the last checkpoint, syncpoint, etc.) by using a data base backup and the log file(s) written since that backup. The last log file(s) represent the exposure because they are onsite due to their recently created status. Some installations actually ship a copy of these log files offsite during the processing day to decrease their exposure.
The fact that a DBMS will provide this recovery processing is one of the major reasons that they have been implemented. Some shops merely use the DBMS as a backup system and structure the data base to be nothing more than flat files under the DBMS umbrella.
There are actually two major concerns with the backup of data: frequency and the content. How often do you back up data and which data do you include? Ideally, all data is backed up all the time--but this is only practical when your resources are limitless.
Thus, the concept of selective backup becomes very inviting. This concept has several variations. One is to take a full backup at some point and follow it with incremental backups until the next full backup. The advantage of this option is that an incremental backup will save only datasets that have been added or changed since the last update. This is much faster and less resource-consuming than a full backup. These datasets are usually selected either by creation date or the change bit in the VTOC.
A second variation is the critical file backup, which will only back up those datasets that are vital to production systems. Other datasets are backed up less frequently. Determining which datasets are vital is the hardest part of this system. Ideally, the process would be automated so it would not have to rely upon human intervention.
Both methods attempt to minimize the risk of recovering a data center while maximizing the up-time for production processing. If all data is backed up every day, then the maximum outage is one day. For no backups, the risk is total catastrophe (see Figure 3). In fact, risk can be extremely high even when the quantity of data that is not backed up is low. The variable risk is dependent upon the “value” or “vital nature” of the data that might be lost.
Every shop faces the dilemma of deciding what data to back up. Some choose to back up everything; others back up piece-by-piece from system specifications; still others play it by ear and back up what they feel is critical data by dataset-name or volume. The full backup has been discussed above. The partial backups pose the problem of determining whether all of the critical data has been included in the backup process. Loss of a single key file can be fatal in a recovery effort.
Several years ago, S/SE, a Systems Programming and Management Consulting firm in Wayne, PA, became involved in a project to automate this process. Together with several customers, they automated the manual process to separate user-data into two piles: data which was required for production processing and data which was not.
It seems simple and, in theory, it is. The concept is to look at all production processing cycles (daily, monthly, quarterly, yearly, etc.) and select those files that are not created within the application system or within another application system (see Figure 4). Data transient in nature--e.g. constantly recreated or created and discarded--is ignored. If data is discarded or not accessed after processing, why create and save a copy of that data?
The mechanics are a bit more complicated. Those datasets that must exist prior to the beginning of application cycle are located from the data logged by the operating system itself (SMF records). The key here is that there is no guesswork in the process. Data from the operating system (SMF records) are used to analyze processing that actually occurred.
Consider the first open of a file. If it was for INPUT, then the file had to exist prior to executing the production cycle. This file must be included in the vital file list. Files opened for OUTPUT with the disposition of NEW would be data created within the processing and would not have to be backed up. Remember that this is the first usage only. If a file was opened first as OUTPUT and in a later subsequent step as INPUT, that data to be read was created in the prior step and would be available even though the file was not backed up from a prior cycle (see Figure 4).
The resultant list of datasets consists of only the prerequisite data required for the production cycle to be processed. The list is not based upon human opinion, but on the processing that occurs while the real system is running.
Once the basic method is developed, several interfaces can be added to make the system more reliable. The most obvious are reports to various departments which reflect the results of the automated selection process.
For DASD datasets, the software used for the physical data transfer to tape is your current DASD archival package. An interface is built into the vital file identification process for your environment. The tape management interface is also built for your current tape management system. This layered implementation allows most of your current technology, investment and knowledge base to be retained. Therefore, the cost to implement this management layer is relatively inexpensive and non-disruptive (see Figure 5).
The selected subset of data fulfills the goals of the Disaster Recovery philosophy: back up the critical data, eliminate risk of data loss, and optimize the resources for production processing. Backing up more data is a waste of system resources; backing up less is exposing the business to unnecessary risk.
An important question to consider is what portion of the user-data area comprises the active-set of datasets required for recovery. Each installation is different, plus the profile of a data center will change over time. Systems mature, new systems are installed, peripheral development is undertaken. All of these affect the makeup of the user-area.
Does this mean that the technique used to provide recovery must be changed? It might unless a philosophy like the one described above is implemented. With older methodologies, the transition from incremental backups to full volume backups is constantly changing and must be tuned daily. Since the tuning process is prone to errors, it is often replaced with the philosophy of totally backing up the DASD farm. The vital file backup method tunes itself using system data so it is reliable and reacts to changes in the data center.
The ideal disaster recovery plan provides constant backup for all data. Unfortunately, this is far from practical in data processing installations. Therefore, the plan must optimize the amount of data which is recoverable at an acceptable level of risk while keeping within cost constraints. This is a policy decision, not a technical one.
The major technical decision is how to optimize the process of backing up data using minimal resources. The fundamental requirement in this task is to understand how and when each data file is changed. With this understanding, a sensible backup plan can be implemented.
Automating the process of data analysis is a giant step toward a safer and more reliable process. Only with such systems can data center management feel secure. If a disaster should occur, they can react to restore the business data processing functions of their organizations.
Fred Schuff is a Systems Programming Consultant in Wayne, PA. His background includes over 20 years of supporting large scale IBM Operating Systems and Data Base Management Systems.
This article adapted from Vol. 3 No. 3, p. 6.