Mirror, Mirror on the Fault...What is the fastest Way To the Vault?
- Published on October 26, 2007
In addition to restoring power to the Bay area, PG&E was facing another task: restoring their damaged data center. These systems perform a number of key business functions that PG&E needs for financial survival, but more importantly at the moment, these systems perform tasks that would be critical to restoring power to the Bay area.
Without minimizing the magnitude of the disaster, some benefit came from the fact that PG&E got a unique, first-hand view of their disaster recovery plan in action.
As a result, PG&E accelerated a plan to build a sophisticated data communications network that they believe will give them the ability to recover data processing operations from a similar disaster in an hour or less.
This article reviews PG&E’s experience with the earthquake and describes their communications-based plan for coping with similar disasters in the future.
New Meaning to
Even before the dust began to settle from the earthquake, employees converged on the data center to begin the process of restoring service. San Francisco’s power was out and PG&E’s emergency power for the data center complex also failed, leaving the data center completely dark.
Armed with flashlights, the employees took to the task of assessing damage and invoking the disaster recovery plans.
The first step was to locate and remove magnetic tape back-ups from their tape vaults. These tapes were to be sent via truck to PG&E’s data center in Fairfield, California, some 50 - 60 miles from the San Francisco area.
Under most circumstances, the trip would be routine. In fact, PG&E’s back-up procedure involved two such trips each day.
However, with the Bay Bridge closed due to a structural failure and the area freeways clogged with panicked commuters, the trip this time would be dangerous at best.
As the crew at PG&E began the task of locating and preparing the tapes for shipping, the Fairfield data center began the process of preparing to run the critical applications to come. Since the Fairfield data center was primarily used for engineering applications, configuring for business applications meant starting from scratch with installing MVS, telecommunications software and applications software. The entire process of reconfiguring Fairfield was estimated to take up to four days.
The Goal: Recovery In One Hour
As it turned out, power to the data center was restored within 12 hours after the quake hit. By the following morning, the San Francisco data center was able to resume processing their critical applications. But the experience served to underscore the need to find a faster way to resume critical business operations at an alternate site.
PG&E set a short term goal of establishing an ability to restore these critical applications within 12 hours of an outage. In the longer term, PG&E set a goal of restoring service within one hour.
To meet these goals, PG&E accelerated a data communications-based program of disaster recovery which had been on the drawing board prior to the earthquake. The plan involved creating the ability within each PG&E data center to act as a hot-site back-up of the other.
When a disaster removes one of the data centers from operation, the other would already have all hardware, software and data necessary to immediately begin the process of recovering PG&E’s critical applications.
The first step PG&E took was to establish a way to get current back-up data to each site.
Every business day, each data center generates a tremendous amount of data that would need to be sent in some usable format to the other data center.
Prior to the earthquake, PG&E had been keeping back-up data on magnetic tapes that were either stored off-site or shipped via truck between the sites.
As the earthquake showed, this approach proved to be impractical and dangerous when a genuine disaster struck. In addition, manually handling the hundreds of tapes required for recovery left too many opportunities for lost or mislabeled data.
To solve this problem, PG&E planned to connect the two data centers using a series of high-volume data communications links.
How Large A Pipe?
With the volume of data PG&E was dealing with (around 80 gigabytes each way each day) and the requirement that each back-up be completed in a matter of hours rather than days, PG&E’s decision to use DS-3 links for transferring data between sites was fairly easy.
DS-3 links are wide-area communications services that operate at 44.736 megabits per second (Mbps). Although DS-3 offers the equivalent throughput of around 28 T1 links (1.544 Mbps), economies of scale and technological advances typically allow the common carriers to offer DS-3 services for the cost of only between 7 and 10 T1 links. In fact, PG&E elected to go with a total of 4 DS-3 links to allow for future growth and to provide for redundancy.
Next, PG&E connected the DS-3 links to the devices they planned to use to recover from a disaster: their Amdahl and IBM mainframes and the 400 Automated Cartridge System (ACS) libraries from Storage Technology Corporation (StorageTek).
These devices communicate using an input/output port called the bus/tag channel, a cable that connects the input/output ports of mainframes and peripherals. Able to transfer data at speeds ranging from 1.5 to 4.5 MegaBytes per second (MBps), the channel is the fastest way to get data into or out of a mainframe.
However, the channel architecture on IBM 370 class mainframes limits the length of the channel to 400 feet — a distance that is not sufficient for disaster recovery (and many other) applications.
This limitation led to the development of “channel extension” devices that connect channels on mainframes or peripherals over either wide areas (as is the case with PG&E) or local areas (within buildings or campuses).
When PG&E began the process of evaluating channel extension products to connect their data centers, one key issue was support for the hosts and devices in their disaster recovery plan.
Many channel extenders operate by mimicking the presence of devices that are actually located in remote sites. Since each kind of device behaves in different ways, channel extension vendors typically need to develop support on a device-by-device basis. In PG&E’s situation, the key devices they needed to support included their hosts and their StorageTek libraries.
Another important consideration was the capability of the channel extenders to transfer the vast amounts of data involved in PG&E’s disaster recovery plan.
PG&E found that channel extenders tend to specialize in certain types of applications.
Those that focus on lower-speed connections, like 256 kbps or T1, would be unable to support PG&E’s throughput requirements. This meant that PG&E needed the devices to support connections to their DS-3 links, a relatively new communications media that tend to be used only in the most throughput-intensive applications.
However, the issue of throughput goes beyond support for high-speed links. In any wide-area connection, communications devices introduce protocol and propagation delays that take a toll on the link’s throughput.
These delays increase as the distance between devices on the network increases, leading to lower throughput as distance increases.
Since channel extenders deal with these delays in different ways, PG&E looked for a product that minimized these delays and maintained throughput over the 50 - 60 miles between data centers.
In addition to throughput, PG&E required that the channel extension devices be designed to be resilient in the event of a disaster. Should one of their DS-3 links be interrupted by a disaster, PG&E wanted the channel extension devices to automatically reroute traffic over other links without operator intervention.
Although PG&E’s plan called for each data center to have the ability to support critical business functions completely independent of the other, alternate path routing would give them a measure of resilience that could prove to be important in a disaster.
After PG&E evaluated the field of channel extension vendors, they chose the CHANNELink line of products from Computer Network Technology Corporation (CNT). Described by CNT as a channel-attached network processor, CHANNELink lets users build complex, high-speed networks of mainframes, peripherals and local-area networks over extended distances.
In addition to providing support for each of the devices PG&E required, CHANNELink supports a variety of local area and wide area links, including DS-3.
To maintain the high throughput required by PG&E, CHANNELink uses an approach to data transfer called “pipelining,” which minimizes the protocol and propagation delays involved in bulk data transfer applications and allows PG&E to make efficient use of their DS-3 links.
Finally, CHANNELink’s ability to automatically reroute traffic around failed links provides the disaster resilience PG&E was looking for.
Configuring For Recovery:
Once the data centers were connected, the next task PG&E addressed was to ensure that the right hardware was present at each center to support the critical applications.
Although both data centers were equipped with large IBM and compatible mainframes, the nature of the applications at each site differed greatly. San Francisco ran primarily batch applications with little or no interactive traffic.
Fairfield, on the other hand, ran engineering applications heavily oriented to interactive TSO applications. Configuring each center to run the other’s applications meant attaching new devices to each host.
The most important of these devices were their tape vaults. These vaults would be used at each site to store the data being transmitted over the CHANNELink/DS-3 network from the sister site.
Key considerations here focused on combining the ability to store large amounts of data with very rapid access. For this task, PG&E decided to use the 4400 Automated Cartridge System (ACS) library from StorageTek.
The StorageTek 4400 ACS is often referred to as Nearline because it combines the speed of on-line storage with the cost-effectiveness and ability to handle large volumes of data off-line storage.
This combination makes the 4400 ACS an ideal candidate for disaster recovery applications. Each library holds up to 6,000 200-megabyte tape cartridges (more with compression and compaction).
In addition to providing very rapid access, the 4400 automates the process of managing the tape libraries, eliminating the problem of lost or mishandled tapes. The San Francisco data center already had eight libraries, two of which could be used to back-up the Fairfield data center. Since no libraries were installed in Fairfield, PG&E planned to install libraries that would be used to back-up both its Fairfield and San Francisco data.
Once the alternate data center begins operations, recovery requires that PG&E’s SNA network have access to the back-up data center for applications it expects to find at the damaged data center.
This is one reason PG&E embarked on a project they call their SNA Improvement Project (SNIP). Roger Ramey, Project Manager over SNIP, said, “The goal of SNIP is to make reconfiguration of our SNA network quick and automatic.”
SNIP is a program of establishing a digital, private mesh network that connects 59 remote end-user sites with the San Francisco and Fairfield data centers. Using remote bridges, a token ring network and T1 communications links, SNIP lets PG&E redirect network traffic from one data center to the other.
A key capability of the token ring network is what Ramey refers to as “source routing.” “Our IBM 3174 communications controllers locate applications based on Token Ring Interface Coupler (TIC) addresses that are only active at one data center but are present at both data centers. When the primary data center goes down, the 3174’s lose their connection and begin polling the network to find an alternate path to the primary host. Meanwhile, the alternate data center restores the applications from their local back-up and runs a CLIST that turns-up the addresses the 3174’s are looking for. As these addresses get passed through the network, the 3174’s connect to the alternate data center as if it were the primary data center. At this point, the users can continue processing without further intervention.”
Another key part of disaster recovery plan is regular testing, and PG&E’s network lends itself to running these tests. In fact, Senior Computer Operations Analyst Dana McKibbin said PG&E regularly runs a mini-test of their disaster preparedness by bringing-up some of the key applications using data from the previous day.
In the future, PG&E plans to run a full-scale test of their ability to recover by completely switching all processing to one data center. Said McKibbin, “Testing at PG&E is an ongoing process that continually refines our ability to recover. In fact, one of the benefits of our disaster recovery plan is that it makes testing relatively easy and inexpensive, and provides for minimal interruption of service.”
In addition to disaster recovery, PG&E’s network is designed to handle other applications as well. For example, PG&E also uses their CHANNELink network for remote printing, another application that can be rather data intensive. Said McKibbin, “Considering the way our network is designed and the capacity we built in, we were able to add remote printing at a pretty small cost.”
But the biggest benefit in McKibbin’s view is the ability to recover from the next earthquake. Said McKibbin, “The earthquake in 1989 highlighted PG&E’s need to recover our critical business applications quickly. But more than that, it highlighted the fact people’s lives can depend on it. If and when the next earthquake hits, I’m confident in our ability to handle it.”
Doug Anderson is a Communications Manager with Computer Network Technology Corporation.
This article adapted from Vol. 5 #1.