Spring World 2015

Conference & Exhibit

Attend The #1 BC/DR Event!

Summer Journal

Volume 27, Issue 3

Full Contents Now Available!

October 26, 2007

Remote Database Logging For Disaster Recovery

Written by  Daniel Kaberon
Rate this item
(0 votes)

Hewitt Associates is an international firm of consultants and actuaries specializing in the design, financing, communication, and administration of employee benefits and compensation programs. One professional group administer defined contribution plans (such as employee savings and 401k), flexible benefits systems and pension administration systems. Hewitt Associates uses Voice Response Units for the employees of our clients to perform such activities as enrollment of flexible benefits, and interrogating and transferring account balances between available investment choices. These facilities are very popular because they make up-to-date information instantly available to employees.

Recently several of us reviewed Hewitt Associates’ requirements for Database recovery in the event of a sudden disaster. Our approach to disaster recovery addresses the question: “How will we recover in the event that, without warning, our data center and its contents become completely inaccessible?”

Formerly it was understood and practiced that nightly backups would provide a sufficient base to accomplish an adequate recovery. In the event of a disaster, databases would be restored to their status of the prior midnight and we would advise users to reenter and reprocess the prior day’s on-line and batch updates.

Our new review indicated that recently our processing world had radically changed. Previously, users were Hewitt Associates (our employees) processing client work through our well-defined network of 3270 terminals. The theoretical universe of users was constrained to the several thousand 3270-type terminals configured to our network located mostly in our offices. Nearly all these terminals were in the charge of someone employed by the firm. Databases were available on a scheduled basis from about 06:00 to 22:00, six days each week.

Today services are provided through Voice Response Units (VRU’s). VRU’s emulate 3270 terminals from touch-tone telephones. The VRU replaces the screen with spoken phrases and the data entry is accomplished via the telephone’s touch-tone keypad.

The impact of this technological shift is immense. At year-end 1992, nearly one million of our clients’ employees have access to their data through our systems. By year-end 1993, this number is likely to approach two million. Taking account of the fact that many of these people work through the night or reside on other continents, we accepted the undeniable requirement for 24-hour database access.

What’s wrong with traditional backup? Traditionally, databases are brought down to accomplish backups with one copy going offsite. Let’s say a database is brought down each day at 22:00 for backup and brought back up at 04:00. Every hour of database availability supports updates which cause the last night’s backup to be further out of synch. One would think that at worst, a daily backup misses 24 hours’ of updates. When we examined our backup practices, we found that a backup tape destined for offsite frequently did not leave the shop until well after noon. Thus, a disaster between 22:00 and noon (14 hours) would not only cause the loss of the disaster day’s updates but the entire prior day’s as well!

Thus, if a disaster struck at 11:00 one morning, the most-recent offsite copy was from two days ago! When coupled with the impossible situation of contacting a million or more possible users, a comprehensive enhancement had to be found and implemented,

Remote Tape Configuration

Hewitt Associates has two campuses in Lincolnshire, Illinois which are about 3.5 miles (6 kilometers) apart. To support data and voice communications between campuses we employ a T-3 communications span broken into twenty-eight T-1 links.

We selected the recently introduced Memorex Telex 5100 ATL. The 5100 has one robotic accessor, one IDRC-capable storage-director with 2-channel switch, four magnetic tape units (drives) and capacity for 315 cartridges. This unit is one square meter in size.

The ATL is attached through a pair of Dataswitch 9200 channel extenders. The 9200 is configured to extend two concurrent channels via a single T-1 link utilizing data compression within the channel extenders.

Memorex Telex’s Library Management Software (LMS) provided very easy controls to direct the needed tape work to the correct tape-drives. No JCL or application changes were made nor were any exits coded or modified.

The 5100 was attached to two MVS systems, one running as a LMS server and the second as a LMS client.

Besides the tape gear, a remote printer was installed and located adjacent to the remote 5100. Every tape dataset which is written to a tape within this ATL is logged to this printer. In the event of a disaster, this listing is the best source to determine which volumes contain the most current data for each database.

Datacom/DB

Datacom/DB provides support for logging completed transactions to an Active Recovery File then to a dedicated tape drive allocated at initialization of the Datacom Multi-User Facility (MUF). Records are written to the active log recovery file whenever the regular disk log areas are greater than x percent full. By specifying a very small value of x (i.e. 1), active log data is adequate to restore databases to very nearly the point in time of failure.

DB2

DB2 does not have a facility equivalent to Datacom’s active logging. DB2 refers to its disk logs as Active Logs which it spills to an Archive Log which can reside on tape. In addition to the various logs, DB2 maintains a Bootstrap dataset (BSDS). The BSDS contains an inventory of all this DB2’s active and archive log datasets and information describing the range of data contained within each one. To assure recoverability, DB2 writes two copies of the archive log tape datasets (ARCHLOG1 and ARCHLOG2). When DB2 writes the Archive Log to tape, it writes the current BSDS as file one and the archive log data as file two. If the installation enables the Catalog Archive parameter, the location of the archive logs will be resolved by the MVS catalog. This is important because it affords us the opportunity to re-stack the archive logs to a smaller number of tapes so long as the relocated datasets are recataloged.

Using the LMS Allocate-Domain table, Hewitt Associates implemented the second copy of the Archive log (ARCHLOG2) in the remote ATL, To assure reasonable currency of the archive log data, log switch commands are issued every hour. Thus, archives are written at least every hour but more frequently when there is more update activity.

DB2 archive Log datasets are controlled by retention period. Because a good deal of our critical work revolves around monthly cycles, we needed capability to recover databases for more than a month. Thus, our retention period of 45 days allows us to span from the beginning of one to the end of the next monthly cycle.

We accomplish stacking through a commercial product (On-line Software’s CARTS). After one day’s log datasets, spread over many tape volumes, are consolidated, the new datasets are recataloged on the new volumes and the original volumes are scratched. We took advantage of the stacking product to make an additional copy of the stacked tape, using our regular pool of ATL drives. This extra copy of ARCHLOG2 is sent offsite and retained for the full 45-day retention period. This allows us to scratch our remote log’s tapes after a few days but have comprehensive recovery capability from offsite tapes. We maintain the primary copy of ARCHLOG2 in the remote ATL for ten-day retentions but this could be further reduced to optimize the ATL’s capacity.

It is essential to route the information describing the newly stacked and recataloged datasets to the remote printer to enable a timely recovery.

Note, it is valuable to see where the population is absorbed in addition to the bottom line.
2 (days) x 24 (hour intervals per day) x 1.5 (tapes per interval) = 72
10 (days retention) * 1.5 (tapes of stacked datasets)=15 Number of tapes needed for static DB2 active log: 72 + 15 = 87.

Alternatives

We looked at several options for accomplishing offsite logging. We looked at our disaster recovery facility’s service for accomplishing DB2 logging through special software associated with a dedicated link to the service. We found this service lacked capability to accomplish CA/DatacomDB. We further found that as volume increased, the cost of this service would increase. Acquiring the 5100, channel extenders, related equipment and paying maintenance for several years was substantially cheaper than the minimum service to do a single DB2.

We also looked at using stand-alone drives with autoloaders extended to the West campus. We determined that, in addition to the obvious responsibilities of removing used tapes and placing in new scratches, occasionally one of these tapes might be needed to affect database forward recovery, Since we had been very successful automating our normal tape processing using three Memorex Telex 5400 ATLs as a Dynamic library, we saw no reason to introduce new manual tape needs. The costs of the 5100 ATL simplified this decision.

A third option uses an IBM program product to maintain the offsite DB2 log but this requires an offsite MVS complex! It didn’t take long to scuttle that option.

Disaster Recovery

In the event of a disaster, all tapes would be removed from the remote 5100 ATL as well as the listing showing the datasets on each tape. After the MVS system in the disaster recovery facility is operational, (with the normal backup data from one or two days ago) all the retrieved datasets would be manually cataloged (from the data in the remote-printer listing). Normal database forward recovery would then proceed without incident from the tapes.

If desired, actually moving the 5100 to the disaster recovery facility is relatively simple because the unit is very small and requires very minimal environmental supports. This would make robotics available for recovery. In practice however, our plan is to resume remote logging to the 5100 as soon as feasible after a disaster.


Daniel Kaberon is a consultant and manager of computer performance and accounting for Hewitt Associates.

Read 2204 times Last modified on October 11, 2012