
Mirror, Mirror on the Fault...
What's the fastest way to the vault?
By Doug Anderson
Pacific Gas and Electric Company (PG&E) understands the impact that being so close to a major fault
line can have.
With terrifying lessons learned from the historic earthquake of 1989, PG&E is implementing a
sophisticated disaster recovery plan based on mirroring data centers. By combining a sophisticated data
communications networks and automated tape vaults, PG&Es approach to mirroring allows their two
primary data centers to back-up each others critical applications.
In the event of a disaster, either data center can begin restore their critical applications and begin
processing on remarkably short notice.
It seems like only yesterday. October 17, 1989. The infamous 89 earthquake hit the San Francisco Bay
area claiming 62 lives and causing damage estimated at as much as $10 billion.
At 77 Beale Street, the Pacific Gas and Electric Company (PG&E) data center was in the process of
changing to the evening operations shift. Fortunately, many employees had left work early to catch the
World Series game between two bay-area teams: the San Francisco Giants and the Oakland Athletics.
In the aftermath of the earthquake, PG&E would face a utility companys nightmare of repairing both
their natural gas and electrical distribution systems while in the midst of a disaster situation. PG&E
president George Maneatis would comment, The earthquake was the worst crisis this company has
faced in my 36 years here.
In addition to restoring power to the Bay area, PG&E was facing another task: restoring their damaged
data center. These systems perform a number of key business functions that PG&E needs for financial
survival, but more importantly at the moment, these systems perform tasks that would be critical to
restoring power to the Bay area.
Without minimizing the magnitude of the disaster, some benefit came from the fact that PG&E got a
unique, first-hand view of their disaster recovery plan in action.
As a result, PG&E accelerated a plan to build a sophisticated data communications network that they
believe will give them the ability to recover data processing operations from a similar disaster in an hour
or less.
This article reviews PG&Es experience with the earthquake and describes their communications-based
plan for coping with similar disasters in the future.
New Meaning to
Lights-Out
Even before the dust began to settle from the earthquake, employees converged on the data center to
begin the process of restoring service. San Franciscos power was out and PG&Es emergency power
for the data center complex also failed, leaving the data center completely dark.
Armed with flashlights, the employees took to the task of assessing damage and invoking the disaster
recovery plans.
The first step was to locate and remove magnetic tape back-ups from their tape vaults. These tapes
were to be sent via truck to PG&Es data center in Fairfield, California, some 50 - 60 miles from the
San Francisco area.
Under most circumstances, the trip would be routine. In fact, PG&Es back-up procedure involved two
such trips each day.
However, with the Bay Bridge closed due to a structural failure and the area freeways clogged with
panicked commuters, the trip this time would be dangerous at best.
As the crew at PG&E began the task of locating and preparing the tapes for shipping, the Fairfield data
center began the process of preparing to run the critical applications to come. Since the Fairfield data
center was primarily used for engineering applications, configuring for business applications meant
starting from scratch with installing MVS, telecommunications software and applications software. The
entire process of reconfiguring Fairfield was estimated to take up to four days.
The Goal: Recovery In One Hour
As it turned out, power to the data center was restored within 12 hours after the quake hit. By the
following morning, the San Francisco data center was able to resume processing their critical
applications. But the experience served to underscore the need to find a faster way to resume critical
business operations at an alternate site.
PG&E set a short term goal of establishing an ability to restore these critical applications within 12
hours of an outage. In the longer term, PG&E set a goal of restoring service within one hour.
To meet these goals, PG&E accelerated a data communications-based program of disaster recovery
which had been on the drawing board prior to the earthquake. The plan involved creating the ability
within each PG&E data center to act as a hot-site back-up of the other.
When a disaster removes one of the data centers from operation, the other would already have all
hardware, software and data necessary to immediately begin the process of recovering PG&Es critical
applications.
Communicating
Volumes Safely
The first step PG&E took was to establish a way to get current back-up data to each site.
Every business day, each data center generates a tremendous amount of data that would need to be sent
in some usable format to the other data center.
Prior to the earthquake, PG&E had been keeping back-up data on magnetic tapes that were either stored
off-site or shipped via truck between the sites.
As the earthquake showed, this approach proved to be impractical and dangerous when a genuine
disaster struck. In addition, manually handling the hundreds of tapes required for recovery left too many
opportunities for lost or mislabeled data.
To solve this problem, PG&E planned to connect the two data centers using a series of high-volume
data communications links.
How Large A Pipe?
With the volume of data PG&E was dealing with (around 80 gigabytes each way each day) and the
requirement that each back-up be completed in a matter of hours rather than days, PG&Es decision to
use DS-3 links for transferring data between sites was fairly easy.
DS-3 links are wide-area communications services that operate at 44.736 megabits per second (Mbps).
Although DS-3 offers the equivalent throughput of around 28 T1 links (1.544 Mbps), economies of
scale and technological advances typically allow the common carriers to offer DS-3 services for the
cost of only between 7 and 10 T1 links. In fact, PG&E elected to go with a total of 4 DS-3 links to
allow for future growth and to provide for redundancy.
Channel-Connections
Next, PG&E connected the DS-3 links to the devices they planned to use to recover from a disaster:
their Amdahl and IBM mainframes and the 400 Automated Cartridge System (ACS) libraries from
Storage Technology Corporation (StorageTek).
These devices communicate using an input/output port called the bus/tag channel, a cable that connects
the input/output ports of mainframes and peripherals. Able to transfer data at speeds ranging from 1.5
to 4.5 MegaBytes per second (MBps), the channel is the fastest way to get data into or out of a
mainframe.
However, the channel architecture on IBM 370 class mainframes limits the length of the channel to 400
feet a distance that is not sufficient for disaster recovery (and many other) applications.
This limitation led to the development of channel extension devices that connect channels on
mainframes or peripherals over either wide areas (as is the case with PG&E) or local areas (within
buildings or campuses).
When PG&E began the process of evaluating channel extension products to connect their data centers,
one key issue was support for the hosts and devices in their disaster recovery plan.
Many channel extenders operate by mimicking the presence of devices that are actually located in
remote sites. Since each kind of device behaves in different ways, channel extension vendors typically
need to develop support on a device-by-device basis. In PG&Es situation, the key devices they
needed to support included their hosts and their StorageTek libraries.
Another important consideration was the capability of the channel extenders to transfer the vast amounts
of data involved in PG&Es disaster recovery plan.
PG&E found that channel extenders tend to specialize in certain types of applications.
Those that focus on lower-speed connections, like 256 kbps or T1, would be unable to support
PG&Es throughput requirements. This meant that PG&E needed the devices to support connections
to their DS-3 links, a relatively new communications media that tend to be used only in the most
throughput-intensive applications.
However, the issue of throughput goes beyond support for high-speed links. In any wide-area
connection, communications devices introduce protocol and propagation delays that take a toll on the
links throughput.
These delays increase as the distance between devices on the network increases, leading to lower
throughput as distance increases.
Since channel extenders deal with these delays in different ways, PG&E looked for a product that
minimized these delays and maintained throughput over the 50 - 60 miles between data centers.
In addition to throughput, PG&E required that the channel extension devices be designed to be resilient
in the event of a disaster. Should one of their DS-3 links be interrupted by a disaster, PG&E wanted the
channel extension devices to automatically reroute traffic over other links without operator intervention.
Although PG&Es plan called for each data center to have the ability to support critical business
functions completely independent of the other, alternate path routing would give them a measure of
resilience that could prove to be important in a disaster.
After PG&E evaluated the field of channel extension vendors, they chose the CHANNELink line of
products from Computer Network Technology Corporation (CNT). Described by CNT as a
channel-attached network processor, CHANNELink lets users build complex, high-speed networks of
mainframes, peripherals and local-area networks over extended distances.
In addition to providing support for each of the devices PG&E required, CHANNELink supports a
variety of local area and wide area links, including DS-3.
To maintain the high throughput required by PG&E, CHANNELink uses an approach to data transfer
called pipelining, which minimizes the protocol and propagation delays involved in bulk data transfer
applications and allows PG&E to make efficient use of their DS-3 links.
Finally, CHANNELinks ability to automatically reroute traffic around failed links provides the disaster
resilience PG&E was looking for.
Configuring For Recovery:
Once the data centers were connected, the next task PG&E addressed was to ensure that the right
hardware was present at each center to support the critical applications.
Although both data centers were equipped with large IBM and compatible mainframes, the nature of the
applications at each site differed greatly. San Francisco ran primarily batch applications with little or no
interactive traffic.
Fairfield, on the other hand, ran engineering applications heavily oriented to interactive TSO
applications. Configuring each center to run the others applications meant attaching new devices to
each host.
The most important of these devices were their tape vaults. These vaults would be used at each site to
store the data being transmitted over the CHANNELink/DS-3 network from the sister site.
Key considerations here focused on combining the ability to store large amounts of data with very rapid
access. For this task, PG&E decided to use the 4400 Automated Cartridge System (ACS) library from
StorageTek.
The StorageTek 4400 ACS is often referred to as Nearline because it combines the speed of on-line
storage with the cost-effectiveness and ability to handle large volumes of data off-line storage.
This combination makes the 4400 ACS an ideal candidate for disaster recovery applications. Each
library holds up to 6,000 200-megabyte tape cartridges (more with compression and compaction).
In addition to providing very rapid access, the 4400 automates the process of managing the tape
libraries, eliminating the problem of lost or mishandled tapes. The San Francisco data center already had
eight libraries, two of which could be used to back-up the Fairfield data center. Since no libraries were
installed in Fairfield, PG&E planned to install libraries that would be used to back-up both its Fairfield
and San Francisco data.
Redirecting Communications
Once the alternate data center begins operations, recovery requires that PG&Es SNA network have
access to the back-up data center for applications it expects to find at the damaged data center.
This is one reason PG&E embarked on a project they call their SNA Improvement Project (SNIP).
Roger Ramey, Project Manager over SNIP, said, The goal of SNIP is to make reconfiguration of our
SNA network quick and automatic.
SNIP is a program of establishing a digital, private mesh network that connects 59 remote end-user sites
with the San Francisco and Fairfield data centers. Using remote bridges, a token ring network and T1
communications links, SNIP lets PG&E redirect network traffic from one data center to the other.
A key capability of the token ring network is what Ramey refers to as source routing. Our IBM 3174
communications controllers locate applications based on Token Ring Interface Coupler (TIC)
addresses that are only active at one data center but are present at both data centers. When the primary
data center goes down, the 3174s lose their connection and begin polling the network to find an
alternate path to the primary host. Meanwhile, the alternate data center restores the applications from
their local back-up and runs a CLIST that turns-up the addresses the 3174s are looking for. As these
addresses get passed through the network, the 3174s connect to the alternate data center as if it were
the primary data center. At this point, the users can continue processing without further intervention.
Benefits For
The Future
Another key part of disaster recovery plan is regular testing, and PG&Es network lends itself to
running these tests. In fact, Senior Computer Operations Analyst Dana McKibbin said PG&E regularly
runs a mini-test of their disaster preparedness by bringing-up some of the key applications using data
from the previous day.
In the future, PG&E plans to run a full-scale test of their ability to recover by completely switching all
processing to one data center. Said McKibbin, Testing at PG&E is an ongoing process that
continually refines our ability to recover. In fact, one of the benefits of our disaster recovery plan is that
it makes testing relatively easy and inexpensive, and provides for minimal interruption of service.
In addition to disaster recovery, PG&Es network is designed to handle other applications as well. For
example, PG&E also uses their CHANNELink network for remote printing, another application that can
be rather data intensive. Said McKibbin, Considering the way our network is designed and the capacity
we built in, we were able to add remote printing at a pretty small cost.
But the biggest benefit in McKibbins view is the ability to recover from the next earthquake. Said
McKibbin, The earthquake in 1989 highlighted PG&Es need to recover our critical business
applications quickly. But more than that, it highlighted the fact peoples lives can depend on it. If and
when the next earthquake hits, Im confident in our ability to handle it.
Doug Anderson is a Communications Manager with Computer Network Technology Corporation.
This article adapted from Vol. 5 #1.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.