Spring World 2015

Conference & Exhibit

Attend The #1 BC/DR Event!

Summer Journal

Volume 27, Issue 3

Full Contents Now Available!

October 26, 2007

Electronic Vaulting Alternatives

Written by  Tom Flesher
Rate this item
(0 votes)

As businesses and other organizations become increasingly reliant upon computerized systems--particularly OLTP (on line transaction processing) systems--MIS executives are required to develop backup plans that ensure the survival of critical data. In a number of industries, the notion of relying on last night’s backup of a critical data base has become obsolete. Users cannot be expected to re-enter lost transactions after recovering from a disaster; instead, a technology-based solution can address the requirement of remote recoverability of critical data bases.

In the IBM mainframe world, a number of possible configurations exist for support of online transaction processing. Generally speaking, mainframe sites use a full function DBMS such as IMS/VS or IDMS/R, or possibly a teleprocessing moniter system such as CICS/VS, to manage the corporate data base. DB2 is emerging as the relational DBMS for future production applications; online transactions are processed using either CICS or IMS as a front-end to DB2. In many large installations, the reality of today’s requirement is frequently some combination of the systems listed here.

Online and batch processing of data bases requires full integrity and recoverability. This is ensured through the use of log or journal files. As transactions are processed, the DBMS writes log or journal records to record the changes in consistent format. In the event of a hardware failure, a forward recovery utility is used to recover from a starting backup copy of the entire data base. Users are already familiar with these techniques, so it would be desirable to extend the concept for remote recovery situations. Thus, remote logging (or remote journaling) in conjunction with off-site backups of complete data bases would seem the most logical and straightforward approach to solving the problem.

This article will describe a number of approaches currently available and analyze each alternative from a number of perspectives. Some of these alternatives use log or journal data and some do not. The important point to keep in mind is the tradeoff between cost and risk; spending more money should lower the risk of data loss. How you assess this tradeoff will affect your decision on which alternative to pursue.

ALTERNATIVE ONE: BATCH TRANSMISSION OF LOG OR JOURNAL DATA

Some installations are sending log/journal data to a remote location using this technique today. As online DASD log/journals files fill up, they are typically archived to tape for possible future recovery requirements. The archive tapes are usually retained for several days, or at least until the full data base can be backed up. An additional step in the archiving process requires that the archived data be transmitted electronically to a remote location, using a channel extender tape drive or host-to-host communications that employs file transfer or bulk data transmission software facilities.

Depending on the size of the on-line DASD log/journal files and transaction rates, the log/journal data may not be transmitted for some time. In many installations, it takes several minutes, maybe hours, for an online log/journal to fill. Thus, the transactions represented in the most recent online log or journal would not be available at the remote site if a serious disaster were to occur. The risk exposure represented by these lost transactions must be estimated. In some industries, the loss of even 15 minutes worth of committed transactions could be fatal. The only ways to reduce the risk would be to use very small online DASD logs or journals, or force to a switch from one online log to another at frequent intervals. Some impact would occur operationally, as many batch jobs must be initiated to process the small batches of log/journal data.

Since the data is transmitted on a delayed basis, the communications facilities must be sized to accommodate the peak sustained logging/journaling rate. If the communications bandwith is not adequate, batches of journal/log data will “back up” at the originating site and increase the risk exposure.

This alternative requires a control mechanism to detect missing or out-of-sequence batches of log/journal data and to deal with other error conditions. For instance, if there is a problem transmitting batch #1234, do you hold up the transmission of batch #1235? What happens if there is a communications or a systems failure?

In general, host-to-host communications are preferable to channel extenders since the hosts can maintain a catalog of information about the log/journal data being transmitted. It is possible, however, for the user of channel extenders to devise the necessary control mechanism at the remote site to ensure continuity and completeness of the logs or journals received.

ALTERNATIVE 2: APPLICATION-SPECIFIC SOLUTIONS USING QUEUING TECHNIQUES

This alternative requires the application developer to imbed special features in the applications software to deal with the business requirement. This approach has been implemented in a number of large organizations and provides a solution that reduces the data loss exposure associated with Alternative #1.

One technique is to queue an “audit trail” record for delivery to the remote system as part of any update transaction processed at the local site. This can be accomplished through standard queuing and communications facilities provided by DBMS systems such as IMS or IDMS. In IMS, for example, it is possible for a transaction program to place a message containing the audit trail data in the message queue for delivery to a remote IMS via an ISC (Inter Systems Communication) link. There is I/O overhead associated with the queuing activity. The audit trail records must contain all the necessary information to recreate the transaction if a recovery is necessary. Special utility programs must be developed to process the audit trail records to effectively recreate the transaction at recovery time.

Depending on the queuing technique employed, an audit trail record should be transmitted to the remote site within a few seconds of the original transaction that created it. Exceptional situations, such as extended communications outages, must be addressed. If the communications link is down for an extended period of time, the queues may overflow, causing loss of critical transaction data.

ALTERNATIVE THREE: REAL-TIME LOG/JOURNAL/DATA BASE DUPLICATION USING CHANNEL EXTENDERS OR “LONG CHANNELS”

This alternative is based on a hardware solution employing channel extenders or the new IBM long channels operating over a fiber-optic communications link. The mainframe host thinks that the disk and tape datasets are locally attached, but they actually may be some distance away. The physical separation of the hardware helps to assure data survival in many, but not all, disaster situations.

The mainframe software must be capable of simultaneously managing duplex data base and/or log/journal files; not all DBMS systems can do this. In the case of IMS, for example, the user may define duplexed log data sets, but not duplexed data based data sets. In the case of CICS, no duplexing software is provided. In the case of IDMS, the user must code a special exit routine to accomplish duplexing of journal or data base data sets.

DASD devices, which permit random access, are inherently more dependent on timing considerations than are tape devices, which are sequentially accessed. Due to purely physical limitations, the remote site cannot be more than a few kilometers away (if using the IBM long channels) or, at most, 20 kilometers (using channel extenders). Even at these distances, severe performance degradation can result. This is due to the fact that channel protocols require complete synchronization between the mainframe and the DASD hardware: no buffering is possible without loss of integrity. The time required for data and confirmation signals to travel at the speed of light over distances measured in kilometers can cause significant degradation of the DBMS environment. Furthermore, if a communications failure were to occur, there would be no easy method to recover. Either the remote site is turned off until a complete refresh of the remote files can take place, or else the production activity at the primary site must be halted. Special procedures must be developed to ensure that the DBMS regions can be successfully restarted at the remote site at recovery time.

ALTERNATIVE FOUR: REAL-TIME BUFFERED TRANSMISSION OF LOG/JOURNAL DATA USING HOST-TO-HOST COMMUNICATIONS FACILITIES

Using standardized exits during real-time DBMS processing, this solution buffers the data in a separate address space for transmission to the remote site. Log/journal data is available for transmission the instant it is created. In normal operation, the data is transmitted within a fraction of a second, guaranteeing the recoverability of transactions right up to the second of a failure. Log or journal data that has been buffered but not yet transmitted would be lost. This data, representing an extremely small number of committed transactions, is the only that would be lost in the event of a disaster.

This alternative’s primary advantage is the relatively low cost. By using a buffered approach, the data can be transmitted asynchronously without affecting the performance of the DBMS regions that are generating the data. The communications link must be able to handle the maximum sustained logging rate but does not need to be as fast as a local DASD channel. The distance limitations inherent in the channel extender approach (Alternative #3) do not apply here--with good use of “pipelining” transmission techniques, it is possible to route the log/journal data to a site thousands of miles away without performance degradation. When possible, the data is buffered in storage rather than being committed to a DASD queue or spool file. With the advent of IBM’s Enterprise Systems Architecture, it is clear that performance-oriented applications and systems do as little real I/O as possible. I/O operations are slow compared to the internal speed of a processor complex, and OLTP systems are frequently I/O intensive. The creative use of address spaces, data spaces or hyperspaces makes it possible to transmit the data without additional I/O activity at the production site. Further, MVS Cross Memory Servies are used to pass the data from the originating DBMS region to the buffering region.

Specialized recovery techniques have been developed to validate the continuity of data being transmitted. If the communications link were to fail for an extended period of time, the buffers in storage would overflow. This is handled through the use of spill files that hold the data until the link is once again available. In addition, a recovery method exists to handle cases of system or power failures; data in the buffers at the time of the failure must be recovered and transmitted upon restart. This can be done by extracting the relevant data from the log or journal files created by the DBMS regions. A sophisticated control mechanism ensures full recoverability in all of these situations. Since the log or journal data is transmitted immediately, no batch jobs are required in normal operation, limiting the degree of operator intervention. The process is significantly more continuous and “online” than the batched approach described in Alternative #1.

ALTERNATIVE FIVE: SYNCHRONIZED SOLUTION USING TWO-PHASED COMMITS OR REDUNDANT TRANSACTION PROCESSING

This alternative relies upon either DBMS-provided or applications-specific facilities to apply an update transaction at two (or more) sites in a fully synchronized manner. The application will not consider a transaction complete until both sites have confirmed that the update has been made. Using this technique, it is possible to maintain “mirror” data bases at the two sites that are always in agreement with each other.

This solution is practical for only the most critical applications, since the cost of implementation is extremely high. Response times will suffer since the transaction must be routed and processed at multiple physical locations. Recovery and integrity problems abound; for example, what would you do if there is a communications failure between the primary and backup sites? Do you halt all processing until the link is available again, or can you allow processing to continue with only one data base available? If you continue to process, how do you resynchronize with the remote site after the link returns?

This alternative depends upon sophisticated Distributed Data Base technology that is just now beginning to appear in commercial offerings. As a practical matter, it may be a number of years before this approach is truly workable.

 


Tom Flesher is Executive Vice-President and co-founder of E-Net Corporation.

This article adapted from Vol. 3 No. 2, p. 32.

Read 2504 times Last modified on October 11, 2012