A Model For The Critical Paths
A return to normal processing after a limited event or a catastrophic event is a logical chain of actions (see above graphic). Certain paths lead to a very quick return to normal, while others lead down longer roads. The delineated boxes of the model are not arbitrary because a transition from one box to another box occurs on a determined binary path. You may or may not successfully resolve the challenges of given boxes. The success or failure of resolving each box determines the path and ultimately whether you succeed or fail.
A crisis is the response to a limited event in which some aspect of normal processing is halted and there is not an immediately known solution. For example, at one data center during routine systems maintenance over a holiday weekend, a systems person renamed a critical data set (SYS1.PROCLIB) in order to re-size it. Absentmindedly, this person forgot to rename the dataset back to its true name and then logged off. At the time only one operator was logged onto the system and they did not have the requisite security to rename the dataset. An IPL would have left the system in a disabled state and it would have had to be recovered or, if there had been no backup, rebuilt. Hence a crisis ensued. There was no existing damage but there was the potential for a lengthy outage. Fortunately, another systems programmer figured out a way to hack the security from the operator’s I.D., and rename the dataset, thus averting a disaster.
A disaster is an event in which normal processing is halted and remediation is required. A disaster has two possible entry points. A catastrophic event can lead directly to a disaster, or a crisis can degrade into a disaster.
An example to illustrate this happened at another data center. During a snowstorm that shut down the state of Connecticut, a large vent was blown off of the roof of the data center. Snow blew in and melted into the ceiling tiles, which fell to the floor with the weight of the water. Snow fell on computer equipment and melted into the equipment and down into the raised floor.
When the situation was discovered, facilities were contacted and they improvised a funnel out of plastic sheeting and duct tape to direct the melting snow into a plastic drum. (They couldn’t get onto the roof with the wind.) A systems programmer powered down the affected computer equipment and facilities mopped up the water from the floor. After a few hours the remaining water evaporated and the equipment was powered up. Facilities fixed the roof when the wind died down. Damage occurred, but no recovery was required so the situation returned to normal.
In some situations disaster cannot be remediated right off and recovery is required. For example, in yet another data center smoke poured out from a UPS during a routine power failover test. The data center manager had the wisdom and temerity to hit the EPO button. Obviously this avoided extensive damage to the data center and lessened the extent of the recovery. But recovery of data was still necessary, as was the remediation of the damage to the UPS.
If no recovery infrastructure exists, you cannot recover and loss occurs. Nevertheless, even if you fully recover your data center, such as at a hot-site, you still need to remediate the damage at your home site, or find a new home site. If, for some reason, you cannot remediate the situation at your home site, then you will also experience loss – even if you have fully recovered your infrastructure at a hot site. This can happen when there is some unique aspect of your home site required by your organization that is lost in the disaster, such as location. While this is a very rare occurrence, for such a disaster, there is no remedy.
When a catastrophic event occurs, there is immediate damage and/or loss of data. This necessarily requires remediation. In the case where it was snowing in the data center, there was remediation of the roof but there was no recovery of systems. Because a solution was improvised, the status of the data center returned to normal. However, if facilities had not been able to fix the roof, or the fix could not be sustained, the data center would have had to have been recovered at a hot site. Hence, a failed remediation may lead to the situation where recovery is required.
Intermediate Summary: Base Recovery Probability
In any given crisis, there are three chances to return to normal and two chances to experience loss. In a disaster, there are two chances to return to normal and two chances to experience loss. These are the base probabilities of recovery. However, these are not an accurate indicator of the true recovery probability, since they do not consider our ability to react to the situation and what processes or plans we might have in effect. In the base scenario, the dice are “unweighted,” and there are equal probabilities of getting through each box with success or failure.
Assigning Probabilities Based On Mitigation Tactics
Each box has its own depth depending upon the size and complexity of the organization and the scarcity or uniqueness of its resources. Some organizations may have unique attributes (such as location or people), which need to be assessed. However, the overall probability of a successful outcome of a given box is only as strong as its highest risk since this risk represents the most probable determinant of the outcome of the box. This represents the “chain is only as strong as its weakest link” principle.
A base value can be assigned to each box based upon the tactics the company chooses.
• Automation of basic system crisis handling (availability)
• Local high-availability clustering
• Local disk RAID
• Change control with back-out procedures
• Crisis management team
• Crisis leadership training
• Escalation procedures
• Crisis response testing
The potential for crisis exists as long as people are around to run a data center. In my experience, most crises are caused by human error. Data center automation and automation of business processes reduce risk and the probability of a crisis. Advanced automation can even handle a crisis, such as a WTO buffer shortage on an MVS mainframe. Change management and clear escalation procedures mitigate and lower the over-all risk associated with standard maintenance and other changes to a data center.
Finally, seasoned leadership in the form of a crisis team with escalation procedures to ensure the team is invoked will often prevent crisis from turning into disaster. Crisis management, however, is not a trivial task. People in a position of responsibility are most likely to react to a crisis as Al Haig did during the attempted assassination of Ronald Reagan. It is almost instinctive to try to “take command.” But such a reaction is usually counter-productive. Crisis teams require training to elicit creativity and flexibility, and crisis leadership is critical.
• Remote high-availability clustering
• Data vaulting
• Emergency response teams
• Emergency response procedures
• Emergency response training
• Emergency response testing
• Succession planning
Many organizations test data recovery procedures without ever testing or invoking the emergency procedures that would be used to manage and declare a disaster. Who makes the decision to declare a disaster? Under what circumstances?
Technical teams or applications teams may be well prepared and may have accomplished unannounced testing. Senior management, responsible for the actual decision, may have not been trained or tested and may not react appropriately. Initial decisions, experience shows, have a lasting impact in reacting correctly to an actual disaster.
Moreover, even if high-availability solutions are in place, management structure will still have to react to the actual disaster. Triage of people and systems may be required at the home site, so these tactics cannot be skipped over or taken lightly. Off-site, high-availability infrastructure may be up and running while the organization is in chaos.
• Disaster recovery teams
• Business continuity teams
• Disaster recovery plans
• Business continuity plans
• Disaster recovery training
• Business continuity training
• Disaster recovery testing
• Business continuity testing
• Systems complexity
• Systems RTO
• Systems RPO
• Systems availability – High-availability solutions with testing increase recovery probability greatly
• Logistics – Must have the means to take care of people and operations
For the purposes of this exercise, “disaster recovery” refers to the recovery of systems, applications, data, and infrastructure. “Business continuity” refers to business departments, manual work-arounds, and work area recovery.
Systems complexity is measured as a function of the core applications systems affected. Depending on the configuration, this is two to the power of N, where N equals the number of unique systems or departments with unique functions (not applications) with clearly defined input and output. The two is for inputs and outputs. If a department performs more than one unique function then the power is increased by one. Each logical grouping of hardware or people in a business process with clearly defined inputs and outputs represents a system that requires recovery as long as they remain in the critical path of the primary business function. It’s easy to see the complexity progresses logarithmically, and hence degrades recovery probability
Systems complexity also brings in the need for the synchronization of backups and/or availability solutions. The more platforms a shop has, the greater the needs and the greater the difficulty of implementing a data synchronization solution.
• Damage Assessment – Must be accurate (inventory checklist).
• Relief – Not over or under insured, FEMA relief or other state relief if necessary. Obtain “pauses” or “continuances” for accounts payable.
• Rebuild – Make sure resources can be obtained quickly for a reasonable price, rebuild to avoid future disasters.
• Relocate– Move the recovered datacenter from hot-site to a transition site or back to the home site; move recovered business operations to a transition or home site.
Remediation is rarely looked at in assessing recovery. Especially if an organization depends upon unique technology or scarce resources, there are particular vulnerabilities. Remediation is covered by insurance so the probability of remediation is constant given a business follows their legal and fiduciary responsibilities to shareholders. Nevertheless, some businesses have some unique aspect, such as a tourist location and for which loss there is no remedy.
• Identify acceptable losses
• Who will set priorities for the distribution of assets?
• Handling personnel issues?
• Bankruptcy and other legal problems?
A business impact assessment (BIA) tells a business those systems that support business processes that have the most impact on the business (loss exposure) and therefore require recovery coverage. But some losses, particularly in an extremely complex business are almost inevitable, or not worth the expense of coverage. What are they? For this reason it’s important to know not only what you need to protect, but what you don’t. The typical calculation where the total aggregate loss of a site is considered is not necessarily helpful because there may be acceptable losses within it. Therefore, the BIA is central to assessing loss exposure, and all departments or systems should be categorized as acceptable or unacceptable losses.
Assessing the recovery probability of a company’s business continuity/disaster recovery program gives a far more complete picture of the effectiveness of those programs than testing alone. Having a clear picture of the structure of disasters helps to integrate all the pieces. And while it’s relatively straightforward to build a business continuity plan and a disaster recovery plan, it’s very expensive and the time and planning efforts have a direct impact on worker productivity. The trick is to build just what is needed according to the risks involved and potential losses.
By breaking out the process into key components and assessing each component, a business can clearly assess the probability of recovery; and hence their risk exposure with respect to disaster recovery and business continuity.
Henry Kalt is the director of business continuity and disaster recovery for Oxford Health Plans Inc. He has published articles in a variety of areas including hermeneutics and psychoneuroimmunology. He can be reached at email@example.com.