
Data Protection Techniques for Midrange Systems
By E. Robert Kleckner
When asked to define a disaster to the data center, most MIS professionals list fire, flood or
earthquake. But something as relatively harmless as a disk failure can cause just as much damage as the
most ferocious tornado.
The key is to make certain that disk failures dont cause your business to come to a standstill or cost
you an inordinate amount of time and money to recover lost data if it is recoverable. Today, the IBM
midrange user has a number of data storage techniques for protecting against data loss due to disk
failures. In order to determine which method is most effective for their businesses, companies must
determine the value of the data they are protecting, the performance they need and the price they are
willing to pay to gain data protection and performance.
There are several traditional methods for data storage protection, as well as a newly emerging method
known as RAID (Redundant Array of Independent Disk). Each technique has its own benefits and
drawbacks, including costs, as well as varying levels of data protection capability.
Traditional Data Storage
Protection Techniques
Journaling is a system function that provides a good audit trail of activity on physical files. It captures
changes on the system without allowing them to be tampered with. Journaling allows users to save files
faster by saving only changed records and provides the ability to recover individual files to the point of
failure.
The advantages of journaling are numerous. Implementation is easy; it provides both forward and
backward recovery and it allows complete recovery up to the last back up. While journaling is one of
the least expensive forms of data protection, its disadvantages include the amount of CPU cycles (3-5
percent) needed and significant disk space that is required.
Checksum protection is a system function that can help avoid a system reload even if a disk drive must
be replaced. The system is able to reconstruct the data on any failing drive in the checksum set after a
single drive failure.
Checksum protection gives users the obvious advantage of being able to recover from a disk drive
failure without reloading the system, which can take up to a full week on larger systems.
The principal advantage of checksum is it enables the system to be available significantly faster after the
disk drive failure, because system reloading is not necessary. Checksum disadvantages include the
requirement of significant CPU cycles (10-15 percent), significant main storage as well as disk space is
required to support it. Usually, a larger and faster CPU and at least one additional disk unit is required.
Although the costs of these additional components make it more costly than journaling, it provides a
higher level of protection.
AS/400 Mirroring is a technique that provides automatic duplication of data on separate devices. When
in place, mirroring ensures that, should a disk fail for whatever reason, the system will continue to
operate and the data will remain completely intact.
This approach allows users to keep the system available during and after a DASD failure, and can
reduce or eliminate downtime for repair or recovery after a DASD or storage hardware failure. AS/400
mirroring allows the user to take advantage of duplicate hardware components in the path between the
processor and the data.
Mirroring does not, however, eliminate the need for back up, does not guarantee 24 hours/day
operation, does not make journaling obsolete and does not keep the system up after a DASD failure in
an unmirrored ASP (auxiliary storage pool). Mirroring uses twice as much DASD and other storage
hardware. Because of the duplication of hardware, mirroring is more costly than the previously
mentioned storage techniques. However, it also provides more protection.
Hot-spares are one of the newer recovery techniques available to the IBM midrange users today. A
hot-spare provides an additional HDA (Head Disk Assembly) to provide added protection for AS/400
users. Implementing a hot-spare option allows the additional disk drive to provide added protection
from data loss.
There are two different ways a hot-spare can be used. One is predictive analysis, where the DASD
device predicts when a drive will fail by tracking soft errors (such as a flawed disk surface). However,
in some situations, the system cannot predict failure, thus the system can crash without soft errors,
possibly resulting in data loss.
An alternative solution is to have the hot-spare implemented in conjunction with some method of data
protection, such as RAID; so that, if an HDA problem becomes evident, the hot-spare is automatically
invoked. This allows for reconstruction, after an HDA failure, on the hot-spare without data loss. This
new method will keep the system running during a failure and utilize the hot-spare to reconstruct the
data, all without data loss.
RAID Serves as
an Alternate Approach
RAID is an alternate approach to prevent data loss. It is a number of definitions that ensure that, even if
a DASD unit fails, the system will continue to run and provide data to the user. While RAID is just now
becoming available for the AS/400, it represents the future for data storage and protection.
Currently, there are five RAID definitions developed by researchers at the University of California at
Berkeley. These methods use redundancy to reconstruct data should a data unit fail. A sixth definition,
RAID 0, is operating without data protection.
RAID 1 is similar to AS/400 mirroring. The DASD subsystem maintains two copies of data. However,
it does not involve the host software or hardware in order to implement or run. Even though there isnt
much host overhead with AS/400 mirroring, DASD subsystem-implemented mirroring eliminates the
overhead completely.
RAID 2 and 3 are suitable only for systems with few requests for huge amounts of data like engineering
or scientific applications.
RAID 4 and 5 can be suitable for on-line processing like the AS/400 is designed to do. Currently, RAID
4 has a severe performance penalty because it uses a single parity drive, but RAID 5 overcomes this by
spreading the parity data over all drives.
The first AS/400 data storage management system based on RAID 5 and hot-spare technology is now
available from XL/Datacomp. Known as Alpine, this data storage device allows reconstruction of data
even though a disk failure has occurred. This not only protects users against loss of data due to disk
failure, but because the system remains operational, users dont lose valuable system time during
recovery.
Comparing Data
Storage Techniques
RAID 5 is much like checksum. In RAID 5, however, the DASD subsystem (the controller), handles all
logic. This significantly reduces the burden placed on the CPU causing a performance boost.
All the CPU has to do is write the record and the DASD subsystem handles the rest. If users experience
performance problems while using checksum, this could be a better solution than a CPU upgrade.
Current implementation of checksum eliminates the need to reload data, but if a drive fails, the system
will too.
In RAID 5, however, the DASD subsystem reconstructs data while the data is still available to the
AS/400. In an extension to RAID 5, a hot-spare unit can be used to replace the failed unit.This allows
the system to stay up and available to users, even if the second unit of a mirrored pair should fail.
Each RAID level has its benefits and drawbacks. Because of the various capabilities, each definition
addresses a different operating environment; one is not inherently better than another.
However, the key to RAID is in the way it adds value. For example, fault tolerance or hot-spare can be
combined with RAID to provide not only data protection, but true continuous availability of data to end
users.
Determining
the Best Solution
When considering which technique to employ, journaling, checksum, mirroring, hot-spares or RAID,
users must weigh the benefits and drawbacks of each.
In evaluating the costs, companies need to thoroughly investigate the cost of data protection versus the
cost of lost personnel productivity and lost or damaged business functions.
Imagine not being able to recover valuable data, not being able to bill customers or not being able to
ship or receive goods. While some protection techniques may seem costly, compared to the costs
associated with system down time, they could easily pay for themselves at the first disk failure.
Managers who identify the level of protection and performance they need and research the various
alternative storage techniques available, will be able to provide their companies with optimal
performance and cost-effective data storage, retrieval and protection.
E. Robert Kleckner became XL/Datacomp Vice President, Marketing Support in 1990, responsible for
technical information on XL/Datacomp products. Prior to that, he was Vice President, Technical
Support. Previously, he was an IBM Senior Systems Engineer and Regional Designated Specialist for
the System/38.
This article adapted from Vol. 5 #4.
DR World Main Index | Return to DRJ's Homepage
Disaster Recovery Worldİ 1999, and Disaster Recovery Journalİ
1999, are copyrighted by Systems Support, Inc. All rights reserved. Reproduction
in whole or part is prohibited without the express written permission form
Systems Support, Inc.