In mid-1990, the 11 banks of the New York Clearing House Association passed a resolution requiring member banks processing a daily average of 20 billion dollars or more in wire transfers in their funds transfer system to comply with a new set of standards for second-level contingency. The clearing house banks must be in compliance with the new standards by June, 1991.
The self-imposed regulations are the most comprehensive set of rules by any group of companies in the world. They go substantially beyond the CHIPS rules and procedures for primary sites, which include:
- Fully redundant on-site CPU backup, including discs, tapes, and terminals.
- Back-up generator or alternate power source for its data center.
- The ability to bring its computer equipment down “softly”...
- Telephone backup lines.
- Adequate security (data and physical).
The new stipulations for a secondary facility extend beyond the institution’s primary processing facility and are the most comprehensive and specific to date among a group of competitive businesses who are interdependent on certain functions. The secondary regulations address the areas of throughput, location, time frames, control/security, communications, and testing. Some of the new requirements are:
"To ensure that the depository institutions can be operational and recover its ability to send and receive wire transfers over CHIPS, Fedwire, and any other network of which it is a participant within six hours of the declared disaster...
The secondary facility must be capable of processing the depository institution’s normal daily average of both number of transactions and value of transactions by approximately midnight of the day of the disaster...
The secondary site should be located at a place sufficiently removed from the primary site so that a disaster would not likely impair both sites at the same time...
The secondary site must receive its electrical power from a power grid different from that which supplies power to the primary site...
The secondary site must be served by a telephone company central office different from that which serves the primary facility...
A secondary data center must have emergency power sufficient to continue the institution’s normal data center operations. Emergency power is also recommended for the back-office site...
The secondary site must be staffed...by both data center and back-office personnel in time to meet the throughput criteria specified above...
The secondary site must be operational within six hours of the declaration of a disaster...
The secondary site is expected to be able to recover the institution’s processing from the point of failure and complete all processing prior to midnight....”
The reasons for such controls are obvious enough to those of us who have been around the data processing business for some time. “There is no question that financial institutions are totally dependent on the use of computers in order to maintain their competitive edge in the marketplace,” says the contingency planner of a large brokerage firm. “This necessitates having a comprehensive contingency plan because, if the computer is lost, the institution cannot conduct business. Even the loss of a few hours of transactions can be extremely costly. For this reason, we have implemented an electronic vaulting capability that captures changed data records as the changes occur and writes them offsite to a tape library where they can be available--if needed--for point of failure data recovery.”
Currently, the New York Clearing House member banks are well on their way to having systems fully operational by June that will satisfy the new regulations. The group has also recommended to the Federal Reserve Bank that any bank in the nation that averages over 20 billion dollars daily in funds transfer meet the same requirement. “Because of the interdependencies among the nation’s banks, the New York Clearing House and the Federal Reserve, it seems prudent to extend these requirements beyond the New York banks,” says George Thomas of the New York Clearing House Association. “The CHIPS system averaged 148,801 payments daily during 1990, for an average sum of over 885 billion dollars. With an average CHIPS transaction being almost six million dollars, you can see why we can’t afford to lose even a single transaction.”
The New York Clearing House maintains a hot backup facility in New Jersey that provides a real-time mirror of their DASD resident files. This gives them the ability to recover processing within five minutes of a failure. It takes 12 minutes to actually re-route all 150 of their communications lines, but they can have a sufficient number of lines re-routed within five minutes to start processing. The real key here is that they lose no data and have full integrity when processing resumes.
One of the larger member banks has found that having data available for recovery to the moment of failure would not be adequate for full compliance with the new regulations.
“We don’t have enough time in the six hour recovery window to down load the necessary files from a tape library,” says their disaster recovery planner. “In order to meet the timing requirements, we must have a duplicate data set, DASD resident, at a second data center location. More than that, from a business standpoint, it can be very costly to the bank if the funds transfer system is not operating; therefore, we must have the application software ready to go immediately. Our bank handles more than 350 billion dollars in the funds transfer system on an average day.” (The interest on 350 billion dollars, at a ten percent simple interest, amounts to over 66 thousand dollars per minute, 24 hours per day, 365 days per year.)
“It makes good business sense for banks to take what appears to be extreme steps for contingency planning,” says Phil Campbell of Chase Manhattan Bank.
“Funds transfer is a high profile application, but it isn’t the only banking function that needs the type of second-level recovery regulations that will be in effect in June; and banking isn’t the only industry that should be enforcing such regulations.
The brokerage industry deals in large sums of money that belong to the public or to investing institutions; and it, too, should look carefully at its level of protection.”
The obvious facts state that all of industry has continued to grow in dependence on the computer. More than that, recent disasters have had an effect on all forms of business in many geographical areas of the country. Recognition of the need for extensive business recovery plans has developed substantially in recent years, but there is still progress to be made in most industries.
On a positive note, more comprehensive recovery capabilities are now available from business recovery vendors. For those companies who do not have multiple facilities or the ability to continue critical functions in the event of a failure, hot site providers have risen to the occasion to offer either electronic vaulting or shadow file processing capabilities. The hardware, software, and communications capabilities are all in place to enable protection of all critical applications. With respect to data volumes, tests have been run using T3 lines at distances as far as three thousand miles, and results indicate that full local channel transfer speeds can be attained.
Perhaps the most interesting aspect of this regulation is that it was self-imposed. It certainly represents a significant step in the recognition of contingency planning as an essential part of any business today, but also interesting is the requirement that a computer disaster recovery facility must have electrical power and communications capabilities totally independent from the primary processing facility.
While most business recovery planners would consider these regulations to be “common sense for critical financial systems, very few corporations, institutions or industries have taken the necessary steps to assure recovery to the extent required by these regulations.
It is good to see that the larger banks are taking the lead in this area, and, hopefully, the rest of industry will follow that lead in a reasonable time period.
In Part 1, I outlined the resolution self-imposed by the banks of the New York Clearing House Association requiring them to have the ability to recover their funds transfer systems within six hours of the declaration of a disaster.
They will also be required to complete the full day's processing by midnight of the same day. It seems appropriate to review the progress the member banks are making toward this objective and to further review some of the regulatory and common sense approaches to contingency planning currently underway in some other industries.
There are 11 banks which comprise the association. Each of the institutions operates its own electronic funds transfer application and must comply with both New York Clearing House and Federal Reserve Bank requirements. As stated in the last article, there is a tremendous amount of money processed through those systems on a daily basis. There is also a great deal of interdependency among the banks, and if any one of them should be unable to complete a day’s transactions, the other could be affected.
The member banks operate funds transfer on a variety of computer platforms, including IBM, DEC, Tandem, and Unisys. Some of the members have contacted a disaster recovery service vendor to provide shadow file processing service, such that duplicate files are maintained at the vendor’s locations. Transaction activity is transferred to the vendor’s site as the records are being written to the journal file on the host computer. The “after image” journal records can then be applied to the “shadow” files or data bases in regular time increments, such as every half hour. The software to drive this system is available on both IBM and Tandem platforms. As currently implemented, it requires about 15 minutes to recover the application and resume processing. “New hardware, software, and communications technology have facilitated this substantial improvement over the last several years,” says Leo Bressman of Manufacturers Hanover Trust in New York.
With the Tandem software (RDF), recovery can be even quicker since it has the ability to update the shadow files as the journal records are received, eliminating the time required to do a forward file recovery prior to bringing up the backup system. “We have customers using this facility for more than disaster recovery,” says Anita Danielson, RDF product manager for Tandem Computers. “Some are planning to use RDF to reduce the time required for planned system outages, such as physical moves, system changes or preventative maintenance. Others are using the remote shadow file to satisfy reporting requirements, since the files are available for browse-read access.”
The remote data files are called shadow files because they are not quite as current as the primary files.
The remote data can lack from milliseconds to minutes worth of data, depending upon such factors as transmission speeds and blocking factors.
Generally speaking, integrity of the data on the remote files can be restored by identifying “in-flight” transactions and enabling the manual re-entry of these few transactions.
The shadow file process is generally transparent to the host computer, operating in its own address space and transferring journal records cross memory from each designated application to its own address space. The records are then written off-site to the vendor’s machine.
Queuing of journal records must be accommodated since the process should not be interrupted if a transmission link were lost. Some of the advantages of shadow file processing include the following:
- Recovery is nearly immediate
- Data is quickly available with few or no lost transactions
- All transfer of data and software is done electronically, with no manual intervention or ground transportation requirements
- The implementation is transparent to systems and applications software
- It is not necessary to transmit or transport backups to the remote site for recovery or for recovery testing
- It is not necessary to have off-site storage for backups for those critical systems that are shadowed
The following are some disadvantages of shadow file processing:
- There can be a slight delay in recovery while journals are applied and in-flight transactions are identified
- It is possible to lose transactions, although this risk can be minimized by reconciling in-flight transactions at the remote site
- It is necessary to implement a change management system that assures that all systems and applications software modifications installed at the primary site are also installed at the backup site
- It is necessary to maintain (or subscribe to) a computing capacity at a remote location that is adequate to handle the continuous updating of the shadow files and has the power to support the critical application(s) if it should be required
- Adequate DASD must be maintained at the remote site to store software and shadow files
Shadow file processing is probably the most economical method of providing for immediate recovery of critical systems available today. Some of its shortcomings can be overcome by implementing a full mirror image system, such as was done for the Clearing House Inter-Bank Payment System (CHIPS) discussed in the last article. This requires changes to all application code, inserting dual write logic in all programs that open a file in update mode. The cost in terms of both programmer and computer time is prohibitive in many organizations.
As part of the regulatory controls of the funds transfer system, banks are subject to penalties if they do not comply with the transaction processing and reporting requirements. One Chicago-based bank had a recent experience when the combination of a software bug and heavy volume due to a holiday caused them to miss payments on over 400 transactions. The cost to the bank in penalties and interest for this single incident amounted to approximately $80,000. That kind of expense could go a long way toward justifying additional budget money for business recovery.
Another aspect of this compliance is the need to recover the business side--i.e., the people who interface with the critical system and upon whom the system is dependent. Business recovery facilities are becoming more common, both from vendors and within corporations for their exclusive use. Recently, when a transformer fire caused a large Manhattan building to be closed to people (due to fear of possible PCB contamination), a major New York bank moved its funds transfer department to a vendor’s business recovery facility (BRF) in New Jersey. From an operations console at the BRF, they actually operated their computer, which was located in the closed building. All work was processed as usual, and all reporting deadlines were met on schedule.
Outside of the banking industry, recovery planning continues to grow in importance. In hospitals, grocery retailing and manufacturing, for example, more and more operations have a need for 24 hour per day operation and are dependent upon the availability of their information systems to accomplish this objective. One large midwestern heavy equipment manufacturer provides information processing for 12 plants. With “just in time” inventory control systems, it is imperative that information systems continue to operate, or the manufacturing plant could be shut down. The expense of shutting down production and sending workers home is very high, from the standpoint of both lost production and wages paid for work not performed.
In hospitals, patient care and pharmacy applications are truly round-the-clock. With increasing frequency, hospitals are becoming dependent upon computer information to perform their everyday tasks. It is essential that the data be available for every patient at the moment the doctor or nurse needs that information.
Insurance carriers, particularly those involved in underwriting business interruption policies, are taking a very close look at a company’s ability to recovery from interruptions with a minimum of lost time and expense. According to one large underwriter in New Jersey, a company that demonstrates the ability to recover quickly can have the benefit of reduced premiums. This is due to two factors, one being that the management of the company, knowing that they can recover in a short period of time, is often inclined to lower the amount of insurance they carry, recognizing that their risk of a prolonged outage has been substantially reduced. The second factor is that insurance underwriters, understanding the customers’ ability to recover quickly, might be more inclined to give the company a lower premium rate, recognizing that they are a better than average risk.
Whether the motivation in your company is avoiding penalties, compliance with government or industry-enforced regulations, or simply trying to maintain a steady work flow and a competitive edge in your industry, it still makes good sense to have a well-developed, thoroughly tested business recovery plan.
Jim Grimm, CDRP, has nearly 25 years of experience in the computer industry and is the Chairman of Business Recovery Consultants, Inc. in Littleton, Colorado.
This article adapted from Vol. 4 No. 1, p. 46; No. 2, p. 54.