When asked to define a disaster to the data center, most MIS professionals list fire, flood or earthquake. But something as relatively harmless as a disk failure can cause just as much damage as the most ferocious tornado.
The key is to make certain that disk failures don’t cause your business to come to a standstill or cost you an inordinate amount of time and money to recover lost data if it is recoverable. Today, the IBM midrange user has a number of data storage techniques for protecting against data loss due to disk failures. In order to determine which method is most effective for their businesses, companies must determine the value of the data they are protecting, the performance they need and the price they are willing to pay to gain data protection and performance.
There are several traditional methods for data storage protection, as well as a newly emerging method known as RAID (Redundant Array of Independent Disk). Each technique has its own benefits and drawbacks, including costs, as well as varying levels of data protection capability.
Traditional Data Storage Protection Techniques
Journaling is a system function that provides a good audit trail of activity on physical files. It captures changes on the system without allowing them to be tampered with. Journaling allows users to save files faster by saving only changed records and provides the ability to recover individual files to the point of failure.
The advantages of journaling are numerous. Implementation is easy; it provides both forward and backward recovery and it allows complete recovery up to the last back up. While journaling is one of the least expensive forms of data protection, its disadvantages include the amount of CPU cycles (3-5 percent) needed and significant disk space that is required.
Checksum protection is a system function that can help avoid a system reload even if a disk drive must be replaced. The system is able to reconstruct the data on any failing drive in the checksum set after a single drive failure.
Checksum protection gives users the obvious advantage of being able to recover from a disk drive failure without reloading the system, which can take up to a full week on larger systems.
The principal advantage of checksum is it enables the system to be available significantly faster after the disk drive failure, because system reloading is not necessary. Checksum disadvantages include the requirement of significant CPU cycles (10-15 percent), significant main storage as well as disk space is required to support it. Usually, a larger and faster CPU and at least one additional disk unit is required. Although the costs of these additional components make it more costly than journaling, it provides a higher level of protection.
AS/400 Mirroring is a technique that provides automatic duplication of data on separate devices. When in place, mirroring ensures that, should a disk fail for whatever reason, the system will continue to operate and the data will remain completely intact.
This approach allows users to keep the system available during and after a DASD failure, and can reduce or eliminate downtime for repair or recovery after a DASD or storage hardware failure. AS/400 mirroring allows the user to take advantage of duplicate hardware components in the path between the processor and the data.
Mirroring does not, however, eliminate the need for back up, does not guarantee 24 hours/day operation, does not make journaling obsolete and does not keep the system up after a DASD failure in an unmirrored ASP (auxiliary storage pool). Mirroring uses twice as much DASD and other storage hardware. Because of the duplication of hardware, mirroring is more costly than the previously mentioned storage techniques. However, it also provides more protection.
Hot-spares are one of the newer recovery techniques available to the IBM midrange users today. A hot-spare provides an additional HDA (Head Disk Assembly) to provide added protection for AS/400 users. Implementing a hot-spare option allows the additional disk drive to provide added protection from data loss.
There are two different ways a hot-spare can be used. One is predictive analysis, where the DASD device predicts when a drive will fail by tracking soft errors (such as a flawed disk surface). However, in some situations, the system cannot predict failure, thus the system can crash without soft errors, possibly resulting in data loss.
An alternative solution is to have the hot-spare implemented in conjunction with some method of data protection, such as RAID; so that, if an HDA problem becomes evident, the hot-spare is automatically invoked. This allows for reconstruction, after an HDA failure, on the hot-spare without data loss. This new method will keep the system running during a failure and utilize the hot-spare to reconstruct the data, all without data loss.
RAID Serves as an Alternate Approach
RAID is an alternate approach to prevent data loss. It is a number of definitions that ensure that, even if a DASD unit fails, the system will continue to run and provide data to the user. While RAID is just now becoming available for the AS/400, it represents the future for data storage and protection.
Currently, there are five RAID definitions developed by researchers at the University of California at Berkeley. These methods use redundancy to reconstruct data should a data unit fail. A sixth definition, RAID 0, is operating without data protection.
RAID 1 is similar to AS/400 mirroring. The DASD subsystem maintains two copies of data. However, it does not involve the host software or hardware in order to implement or run. Even though there isn’t much host overhead with AS/400 mirroring, DASD subsystem-implemented mirroring eliminates the overhead completely.
RAID 2 and 3 are suitable only for systems with few requests for huge amounts of data like engineering or scientific applications.
RAID 4 and 5 can be suitable for on-line processing like the AS/400 is designed to do. Currently, RAID 4 has a severe performance penalty because it uses a single parity drive, but RAID 5 overcomes this by spreading the parity data over all drives.
The first AS/400 data storage management system based on RAID 5 and hot-spare technology is now available from XL/Datacomp. Known as Alpine, this data storage device allows reconstruction of data even though a disk failure has occurred. This not only protects users against loss of data due to disk failure, but because the system remains operational, users don’t lose valuable system time during recovery.
Comparing Data Storage Techniques
RAID 5 is much like checksum. In RAID 5, however, the DASD subsystem (the controller), handles all logic. This significantly reduces the burden placed on the CPU causing a performance boost.
All the CPU has to do is write the record and the DASD subsystem handles the rest. If users experience performance problems while using checksum, this could be a better solution than a CPU upgrade.
Current implementation of checksum eliminates the need to reload data, but if a drive fails, the system will too.
In RAID 5, however, the DASD subsystem reconstructs data while the data is still available to the AS/400. In an extension to RAID 5, a hot-spare unit can be used to replace the failed unit.This allows the system to stay up and available to users, even if the second unit of a mirrored pair should fail.
Each RAID level has its benefits and drawbacks. Because of the various capabilities, each definition addresses a different operating environment; one is not inherently better than another.
However, the key to RAID is in the way it adds value. For example, fault tolerance or hot-spare can be combined with RAID to provide not only data protection, but true continuous availability of data to end users.
Determining the Best Solution
When considering which technique to employ, journaling, checksum, mirroring, hot-spares or RAID, users must weigh the benefits and drawbacks of each.
In evaluating the costs, companies need to thoroughly investigate the cost of data protection versus the cost of lost personnel productivity and lost or damaged business functions.
Imagine not being able to recover valuable data, not being able to bill customers or not being able to ship or receive goods. While some protection techniques may seem costly, compared to the costs associated with system down time, they could easily pay for themselves at the first disk failure.
Managers who identify the level of protection and performance they need and research the various alternative storage techniques available, will be able to provide their companies with optimal performance and cost-effective data storage, retrieval and protection.
E. Robert Kleckner became XL/Datacomp Vice President, Marketing Support in 1990, responsible for technical information on XL/Datacomp products. Prior to that, he was Vice President, Technical Support. Previously, he was an IBM Senior Systems Engineer and Regional Designated Specialist for the System/38.
This article adapted from Vol. 5 #4.