Approaches to Business Continuity Planning
A plan for business continuity must describe the actions to be taken in the event of a serious disruption of normal business activities. It should address criteria for execution of the plan, define responsibilities and authorities, and give guidance to those who will be executing the plan. It must be a “living document” that is kept up to date as changes are made in the organization and the data processing system.
There are many ways to approach the creation of a business continuity plan. A “center-level” approach requires the data processing department write a plan to back up either all applications running on a system or all systems in the center. In the event of a disaster, all applications are recovered at the same time whether or not they are critical, adding time and complexity to the process.
An “application-level” approach to planning often better meets the needs of enterprises running critical applications. Planners in individual business functions determine what their critical business process and supporting application needs are, and then they develop contingency plans for each of these applications. The benefits of this approach are 1) Non-critical applications do not consume valuable recovery time, 2) Multiple applications can be recovered in parallel, and 3) The priorities of end users are considered, putting them back to work faster.
Whichever planning approach is selected, more people are actively involved in creating the plan and there is a greater chance that the plan will work and computer service will be available.
The Need for High Availability Systems
Before a computer outage occurs, an enterprise can protect applications supporting vital business functions by using a computer architecture that provides high availability through hardware and software fault tolerance. Such an architecture requires multiple processors with separate copies of the operating system, and pairs of essential components (such as disks, buses, and controllers). If one component fails, the other takes over without loss of data or service. Likewise, in a system with software fault tolerance, if one software component fails, the other takes over immediately to keep the application running.
In addition to high availability, such an architecture is also well-suited to OLTP because it provides the following:
- Data Integrity--System software ensures that a transaction is completed as a whole or not at all, even in the event of a power or other system failure.
- Security--Database management is integrated into the operating system. This prevents subversion by a user who opens and writes to files, going through the operating system and bypassing the database.
- Linear Performance Growth--Adding a processor provides almost a 100% performance improvement from each incremental processor. This performance growth can continue almost indefinitely.
- Modular Expandability--More processors, disks, workstations, etc. can be added to the system without taking it down or changing application or system code.
- Connectivity--The ability to connect to other vendors’ systems and networks protects an enterprise’s investment and improves productivity of users and equipment
- Distributed Processing--One logical database is spread across any number of geographically remote systems. Users at local and remote systems perceive the entire database as if it were stored locally.
- Price/Performance--Excellent price performance is gained by using a high-performance parallel-processing architecture coupled with an operating system that is optimized for OLTP. Added to this is a database that is tailored for OLTP and closely matched to the architecture.
A robust, parallel architecture can play an important role as part of a business continuity plan. With some systems there is a requirement to bring the system down when new applications are hardware are added. With a hardware and software fault-tolerant architecture, new applications and new hardware can be added without taking the system down and without the need to change code. In addition, most system maintenance can be performed while the system is online.
Coupled with the use of an “online ready site” (a method of shadow vaulting), an enterprise can obtain close to continuous availability for OLTP applications, even in the event of faults (computer, telecommunications, or human) or adverse environmental conditions (power outages, fires, floods, etc.).
Disaster Recovery Solutions
To support a business continuity plan, an enterprise must select a recovery method. These methods include hot-sites, cold-sites, mobile sites, service bureaus, reciprocal contingency agreements, and an online ready site.
A hot-site requires from 12 to 48 hours to take over service after a disaster. This includes the time spent retrieving the database tapes from archival storage, transporting the tapes and DP staff to the hot-site, restoring the data to disk, and restarting the application. Archived data used with a hot-site is out-of-date by the amount of time since the last magnetic tape copy of the database was made and physically transported to the site. This lost data can represent one or more days of activity, depending on the backup schedule. This level of protection is not adequate for critical OLTP applications.
A cold-site agreement provides a computer-ready room reserved for the subscriber’s system. It usually contains power distribution systems, phone wiring, a raised floor, and temperature control. A minimum effort should be necessary to deliver and assemble the computer system at the site, and arrangements for quick delivery should be made with a hardware vendor so that operations can be restored before losses become unacceptable. Cold-site recovery can be time-consuming and costly in terms of lost business.
Mobile hot and cold-sites are a relatively new service. Computer-ready trailers can be set up in a subscriber’s parking lost and linked by a trailer sleeve to create a space to suit the subscriber’s recovery needs. This minimizes the travel arrangements for DP employees who may be reluctant to leave their homes and families after a disaster. The service allows a decentralized organization to engage one vendor to service its entire organization.
Service bureaus provide immediate access to timesharing services at a cost that is usually less than other backup options. However, service is usually available for short-term use only, and there is little database security. In any shared service agreement, the promises made to other subscribers can interfere with an enterprise’s urgent needs, and service conditions and capabilities are subject to change. In the event of a regional disaster, there is the potential that the supplier will not be able to provide the required service within the necessary timeframe.
A reciprocal contingency agreement with another company with similar computer systems and applications is an inexpensive alternative, but it can have many drawbacks. The agreements are not always enforceable. The site owner has first priority for its use, and access time for testing may be difficult to obtain. Programming changes are usually required to run the recovering site’s applications on another equipment configuration. If the two sites are located in the same area, they both could be impacted by a disaster.
An online ready site is a complete computing environment with computer systems, applications, telecommunications facilities, staff, and a continuously updated copy of the database. It provides recovery within minutes of a disaster when the backup applications is running instead of hours and days as with other types of sites. It allows the enterprise to maintain a current, online copy of a database on a remote network node. The site can be located on a system next door for convenience, or across the nation to minimize the effects of wide area or regional disasters. Because the system immediately updates the copy of the database after the original database is updated, data loss from a disaster can be limited to as little as one second of processing.
Planning for Prevention
While no one likes to think about potential disasters and how they impact business and personnel, the best way to reduce the risk of catastrophic data loss is by maintaining a business continuity plan. To support the plan, a disaster recovery solution that provides cost-effective, time-efficient protection and a fault-tolerant system architecture can serve to prevent unavailability of critical business applications. This support can assure enterprise management of the continuance of their business and give them the confidence to increase their reliance on OLTP as a means to advance enterprise competitiveness.
Michael Katz is the Product Marketing Manager with Tandem Computers, Inc. in California. He is responsible for developing Tandem’s corporate programs and strategies that support manageability, operability, security, and support. Prior to this position, he was the Manager of Systems Software Product Management.
This article adapted from Vol. 4 No. 1, p. 20.