
Electronic Vaulting Alternatives
By Tom Flesher
As businesses and other organizations become increasingly reliant upon computerized systems--particularly OLTP (on line
transaction processing) systems--MIS executives are required to develop backup plans that ensure the survival of critical data. In a
number of industries, the notion of relying on last nights backup of a critical data base has become obsolete. Users cannot be
expected to re-enter lost transactions after recovering from a disaster; instead, a technology-based solution can address the
requirement of remote recoverability of critical data bases.
In the IBM mainframe world, a number of possible configurations exist for support of online transaction processing. Generally
speaking, mainframe sites use a full function DBMS such as IMS/VS or IDMS/R, or possibly a teleprocessing moniter system such
as CICS/VS, to manage the corporate data base. DB2 is emerging as the relational DBMS for future production applications; online
transactions are processed using either CICS or IMS as a front-end to DB2. In many large installations, the reality of todays
requirement is frequently some combination of the systems listed here.
Online and batch processing of data bases requires full integrity and recoverability. This is ensured through the use of log or journal
files. As transactions are processed, the DBMS writes log or journal records to record the changes in consistent format. In the event
of a hardware failure, a forward recovery utility is used to recover from a starting backup copy of the entire data base. Users are
already familiar with these techniques, so it would be desirable to extend the concept for remote recovery situations. Thus, remote
logging (or remote journaling) in conjunction with off-site backups of complete data bases would seem the most logical and
straightforward approach to solving the problem.
This article will describe a number of approaches currently available and analyze each alternative from a number of perspectives.
Some of these alternatives use log or journal data and some do not. The important point to keep in mind is the tradeoff between
cost and risk; spending more money should lower the risk of data loss. How you assess this tradeoff will affect your decision on
which alternative to pursue.
ALTERNATIVE ONE: BATCH TRANSMISSION OF LOG OR JOURNAL DATA

Some installations are sending log/journal
data to a remote location using this
technique today. As online DASD
log/journals files fill up, they are typically
archived to tape for possible future recovery
requirements. The archive tapes are usually
retained for several days, or at least until the
full data base can be backed up. An
additional step in the archiving process
requires that the archived data be transmitted
electronically to a remote location, using a
channel extender tape drive or host-to-host
communications that employs file transfer or
bulk data transmission software facilities.
Depending on the size of the on-line DASD
log/journal files and transaction rates, the
log/journal data may not be transmitted for
some time. In many installations, it takes
several minutes, maybe hours, for an online
log/journal to fill. Thus, the transactions
represented in the most recent online log or
journal would not be available at the remote
site if a serious disaster were to occur. The risk exposure represented by these lost transactions must be estimated. In some
industries, the loss of even 15 minutes worth of committed transactions could be fatal. The only ways to reduce the risk would be to
use very small online DASD logs or journals, or force to a switch from one online log to another at frequent intervals. Some impact
would occur operationally, as many batch jobs must be initiated to process the small batches of log/journal data.
Since the data is transmitted on a delayed basis, the communications facilities must be sized to accommodate the peak sustained
logging/journaling rate. If the communications bandwith is not adequate, batches of journal/log data will back up at the originating
site and increase the risk exposure.
This alternative requires a control mechanism to detect missing or out-of-sequence batches of log/journal data and to deal with other
error conditions. For instance, if there is a problem transmitting batch #1234, do you hold up the transmission of batch #1235?
What happens if there is a communications or a systems failure?
In general, host-to-host communications are preferable to channel extenders since the hosts can maintain a catalog of information
about the log/journal data being transmitted. It is possible, however, for the user of channel extenders to devise the necessary
control mechanism at the remote site to ensure continuity and completeness of the logs or journals received.
ALTERNATIVE 2: APPLICATION-SPECIFIC SOLUTIONS USING QUEUING TECHNIQUES

This alternative requires the application
developer to imbed special features in the
applications software to deal with the
business requirement. This approach has
been implemented in a number of large
organizations and provides a solution that
reduces the data loss exposure associated
with Alternative #1.
One technique is to queue an audit trail
record for delivery to the remote system as
part of any update transaction processed at
the local site. This can be accomplished
through standard queuing and
communications facilities provided by
DBMS systems such as IMS or IDMS. In
IMS, for example, it is possible for a
transaction program to place a message
containing the audit trail data in the message
queue for delivery to a remote IMS via an
ISC (Inter Systems Communication) link.
There is I/O overhead associated with the
queuing activity. The audit trail records must contain all the necessary information to recreate the transaction if a recovery is
necessary. Special utility programs must be developed to process the audit trail records to effectively recreate the transaction at
recovery time.
Depending on the queuing technique employed, an audit trail record should be transmitted to the remote site within a few seconds of
the original transaction that created it. Exceptional situations, such as extended communications outages, must be addressed. If the
communications link is down for an extended period of time, the queues may overflow, causing loss of critical transaction data.
ALTERNATIVE THREE: REAL-TIME LOG/JOURNAL/DATA BASE DUPLICATION USING CHANNEL EXTENDERS OR
LONG CHANNELS

This alternative is based on a hardware
solution employing channel extenders or the
new IBM long channels operating over a
fiber-optic communications link. The
mainframe host thinks that the disk and tape
datasets are locally attached, but they
actually may be some distance away. The
physical separation of the hardware helps to
assure data survival in many, but not all,
disaster situations.
The mainframe software must be capable of
simultaneously managing duplex data base
and/or log/journal files; not all DBMS
systems can do this. In the case of IMS, for
example, the user may define duplexed log
data sets, but not duplexed data based data
sets. In the case of CICS, no duplexing
software is provided. In the case of IDMS,
the user must code a special exit routine to
accomplish duplexing of journal or data
base data sets.
DASD devices, which permit random
access, are inherently more dependent on timing considerations than are tape devices, which are sequentially accessed. Due to
purely physical limitations, the remote site cannot be more than a few kilometers away (if using the IBM long channels) or, at most,
20 kilometers (using channel extenders). Even at these distances, severe performance degradation can result. This is due to the fact
that channel protocols require complete synchronization between the mainframe and the DASD hardware: no buffering is possible
without loss of integrity. The time required for data and confirmation signals to travel at the speed of light over distances measured
in kilometers can cause significant degradation of the DBMS environment. Furthermore, if a communications failure were to occur,
there would be no easy method to recover. Either the remote site is turned off until a complete refresh of the remote files can take
place, or else the production activity at the primary site must be halted. Special procedures must be developed to ensure that the
DBMS regions can be successfully restarted at the remote site at recovery time.
ALTERNATIVE FOUR: REAL-TIME BUFFERED TRANSMISSION OF LOG/JOURNAL DATA USING HOST-TO-HOST
COMMUNICATIONS FACILITIES

Using standardized exits during real-time
DBMS processing, this solution buffers the
data in a separate address space for
transmission to the remote site. Log/journal
data is available for transmission the instant
it is created. In normal operation, the data is
transmitted within a fraction of a second,
guaranteeing the recoverability of
transactions right up to the second of a
failure. Log or journal data that has been
buffered but not yet transmitted would be
lost. This data, representing an extremely
small number of committed transactions, is
the only that would be lost in the event of a
disaster.
This alternatives primary advantage is the
relatively low cost. By using a buffered
approach, the data can be transmitted
asynchronously without affecting the
performance of the DBMS regions that are
generating the data. The communications
link must be able to handle the maximum
sustained logging rate but does not need to
be as fast as a local DASD channel. The distance limitations inherent in the channel extender approach (Alternative #3) do not apply
here--with good use of pipelining transmission techniques, it is possible to route the log/journal data to a site thousands of miles
away without performance degradation. When possible, the data is buffered in storage rather than being committed to a DASD
queue or spool file. With the advent of IBMs Enterprise Systems Architecture, it is clear that performance-oriented applications
and systems do as little real I/O as possible. I/O operations are slow compared to the internal speed of a processor complex, and
OLTP systems are frequently I/O intensive. The creative use of address spaces, data spaces or hyperspaces makes it possible to
transmit the data without additional I/O activity at the production site. Further, MVS Cross Memory Servies are used to pass the
data from the originating DBMS region to the buffering region.
Specialized recovery techniques have been developed to validate the continuity of data being transmitted. If the communications link
were to fail for an extended period of time, the buffers in storage would overflow. This is handled through the use of spill files that
hold the data until the link is once again available. In addition, a recovery method exists to handle cases of system or power failures;
data in the buffers at the time of the failure must be recovered and transmitted upon restart. This can be done by extracting the
relevant data from the log or journal files created by the DBMS regions. A sophisticated control mechanism ensures full
recoverability in all of these situations. Since the log or journal data is transmitted immediately, no batch jobs are required in normal
operation, limiting the degree of operator intervention. The process is significantly more continuous and online than the batched
approach described in Alternative #1.
ALTERNATIVE FIVE: SYNCHRONIZED SOLUTION USING TWO-PHASED COMMITS OR REDUNDANT
TRANSACTION PROCESSING

This alternative relies upon either
DBMS-provided or applications-specific
facilities to apply an update transaction at
two (or more) sites in a fully synchronized
manner. The application will not consider a
transaction complete until both sites have
confirmed that the update has been made.
Using this technique, it is possible to
maintain mirror data bases at the two sites
that are always in agreement with each other.
This solution is practical for only the most
critical applications, since the cost of
implementation is extremely high. Response
times will suffer since the transaction must be routed and processed at multiple physical locations. Recovery and integrity problems
abound; for example, what would you do if there is a communications failure between the primary and backup sites? Do you halt all
processing until the link is available again, or can you allow processing to continue with only one data base available? If you
continue to process, how do you resynchronize with the remote site after the link returns?
This alternative depends upon sophisticated Distributed Data Base technology that is just now beginning to appear in commercial
offerings. As a practical matter, it may be a number of years before this approach is truly workable.
Tom Flesher is Executive Vice-President and co-founder of E-Net Corporation.
This article adapted from Vol. 3 No. 2, p. 32.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.