BC Analysis Using Computer Simulation
- Published on November 19, 2007
On the other hand, computer-based simulation can leverage the speed and efficiency of today’s computing technology to iterate through numerous modeled scenarios that can simulate recovery operations, policies and strategies. It can provide a low cost way to dynamically simulate business recovery scenarios even before going through the expense of implementing a physical recovery site operation. In fact, results can even be used to help identify the type of recovery site or operation required. It can also be used to gain valuable insight into such questions as:
- What are the continuity requirements of a new recovery operation?
- Are proposed business recovery requirements such as recovery time objective (RTO) and recovery point objective (RPO) achievable given the current operation?
- What are the implications of a new recovery technology (e.g. rapid spanning tree, mirroring) on a current recovery operation?
- What degree of operational flexibility can be tolerated within a current recovery operation?
Proactive business interruption simulation is a modeling technique that can be used to characterize failure and recovery scenarios in response to different types of adverse conditions or events. It can be applied to almost any level of an IT infrastructure – from an entire enterprise down to a single system or component level. When it comes to recovery, time is of the essence. For this reason, this analysis involves the creation of an event graph that characterizes the different states of recovery operation and computes the time to transition between states, based on observed or assumed probability distributions. The graph is then evaluated as a numerical model.
This type of simulation model is often referred to as a discrete simulation model since the states of the recovery operation can change instantaneously at different points in time. These changes are described by probability distributions that can be either discrete or continuous in nature. Although knowledge of the probability distribution of different events within a recovery operation can be obtained from historical data, most often such data is unavailable. Instead, some well known probability distributions are assumed for different types of operations. Furthermore, users can specify their own distributions based on experience, educated guess or historical data, if available, in lieu of the default distributions.
Two general types of simulation models are involved: micro-simulation and macro-simulation. In micro-simulation, recovery operations are modeled as a single server queuing system. An outage event is continuously repeated and the recovery operations in response are noted. By repeating trial outages and observing key recovery statistics, inferences can be made regarding the efficacy of a recovery operation. This type of simulation is known as Monte-Carlo simulation, whereby samples are repeatedly obtained from known probability distributions for the parameters in the recovery model and the resulting output measures are recomputed with each trial.
Such simulation can be used to analyze disaster recovery (DR) test scenarios. DR tests are often executed using actual IT systems and personnel to assess whether recovery objectives stipulated in a DR plan can be satisfied. However, exercising such tests with actual systems and personnel several hundred times is rarely feasible and can be quite expensive, let alone disruptive to operations. On the other hand, simulation analysis uses a mathematical representation of the recovery such that simulated outages are replicated on the computer, several hundred times if necessary, and pertinent recovery statistics are produced.
Macro-simulation incorporates many of the same features of micro-simulation, but with additional information regarding outage probability. Unlike the single-queue model used in micro-simulation, macro-simulation models an IT environment as a multi-server queuing system. In this case, many parallel recovery operations in response to multiple simultaneous outages can be characterized. Such simulations can be used to determine the effect of improved recovery operations in a large IT environment based on the likelihood of outages. Recovery in response to widespread rolling outages can thus be simulated and analyzed. In this case, simulated outage events are generated based on historically observed or assumed probability distributions.
Observed data characterizing mean time between service outages can be a valuable input to identifying outage probability distributions. However, such data may only capture threats that have been known to historically plague an IT operation. This is why a detailed business impact analysis (BIA) can be an important precursor to both micro-simulation and macro-simulation since it can unveil those threats, either known or unknown, that are likely to expose an enterprise to the highest degree of risk. With this knowledge in hand, such threats can then be incorporated into the proactive business interruption simulation framework to test resulting recovery impacts. Figure 1 illustrates how simulation methodology fits into the overall business continuity planning process.
Let’s demonstrate the power of using computer simulation for evaluating business continuity recovery operations through a simple case study. In this scenario, a micro-simulation model was created to evaluate recovery site operations for an e-commerce data center which processes transactions from users over the Internet. The data center uses a recovery service provider to provide a backup data center in the event service is disrupted inside the main data center. Figure 2 illustrates the arrangement.
Service requests and responses are generated between the primary data center and users over the Internet. The recovery site is positioned as a warm site, implying that data transfers are mirrored to the site asynchronously over the wide area network (WAN) at periodic snapshot intervals (SIs). The warm site is kept abreast of the primary data center’s operational state through the use of heartbeat links provisioned over the WAN. Keep-alive packets are periodically issued from the primary data center to the warm site over these links. Servers at both the primary data center and the warm site are arranged as a wide area cluster, such that failover would occur in the event loss of heartbeat is detected. Concurrent with failover, global load balancers situated at the Internet service provider locations would gradually redirect Web queries to the warm site.
Because of the warm site arrangement, detection and failover are not instantaneous and involve a delayed reaction depending on the status of application processing and data mirroring. Although the detection and failover time windows can vary somewhat, the time thresholds for these tasks can be reasonably identified by virtue of the technologies being utilized. However, the wild card in this arrangement is the mean time to restore (MMTR), which can vary widely depending on the precise nature of the outage. Hardware failures, application hangs, viruses, intrusion detections and environmental disasters (e.g. fire, floods, etc.) each imply varying degrees of recovery operation. Here, historical data or knowledge on repair and restoration times could provide valuable insight into the variability of these tasks.
The above scenario is characterized using an event graph model that encompasses each of the recovery tasks and states. Inputs to the model represent mean times versus absolute times for each of the recovery tasks. Probability distributions for these can be specified as well. The simulation then evaluates the behavioral response of the recovery operation to an outage, taking into account the degrees of uncertainty in each recovery task, and consequently the variation in the overall operation. A run of 100 recovery scenarios can be easily repeated if desired to evaluate whether recovery objectives can be met, given real-world variation and uncertainties in recovery.
Figure 3 shows an example of the type of output that can be produced by the simulation. Simulation runs of 100 trial outages can be itemized using a detailed event log from which summary distributions can be derived. Each entry in the log summarizes statistics for one simulated trial outage and the corresponding recovery operation. To truly characterize the variability in recovery, many simulated trial outages are required. Applying the Monte Carlo principle, random probabilities are entered into the specified probability distributions for each trial and a corresponding time statistic is obtained for each of the events in the recovery graph model. Repeating this a hundred times or so thus produces the log entries and consequently recovery distributions (Note: Figure 3 shows only the first 15 entries).
The simulation then summarizes the data into frequency distributions. Two distributions can be developed showing the expected recovery time (ERT) that could be realized by the simulated recovery environment – one for service ERT (SERT) and one for operational ERT (OERT). SERT is the observed recovery time for service continuation, which in this case are Web transactions. While SERT can capture variability in service interruption from a user perspective and can be compared with a service RTO, OERT characterizes recovery of the overall operation and thus can be compared with a target operational RTO. This distinction is made for an important reason. Although warm site failover mechanisms might enable reasonable continuation of service, full recovery is not achieved until primary site operation is fully restored. Thus, it is imperative to distinguish between service and operational recovery to better characterize and analyze recovery viability.
Case Study Results
On the surface, it seemed that recovery operations for the modeled scenario were well within the stipulated recovery objectives. Based on the maximum and minimum time inputs for various model parameters such as the MTTR, time to data, and times to detect, fail over and resume from an adverse event, one would believe that the service and operational RTO objectives can be reasonably met. Furthermore, at first glance it also seemed that a four-hour RPO was reasonable given the SI of the mirroring technology in use.
Yet, dynamic simulation of this recovery operation reveals the potential for violating these performance envelope thresholds. A significant number of runs show the observed SERT exceeding the service RTO by one to two minutes. Although this may not seem significant, it could have implications with respect to satisfying service criteria within customer service level agreements (SLAs). On the other hand, the OERT stays well within the operational RTO, indicating that the overall recovery operation as modeled is satisfactory within stated recovery objectives.
However, one area of concern is the fact that the observed data loss expectancy (DLE) can exceed the RPO, and consequently the SI, up to and even slightly beyond 10 percent of the time. Although SI intervals are violated a small percent of the time, this could indicate the potential for mirroring corrupted data unless mirroring is halted and data is frozen following the outage. These results thus signal the need for revisiting the mirroring technology in use, and perhaps reviewing the stipulated RPO objective.
Although exercising mock outages with live physical systems can help smooth out the wrinkles in how recovery procedures are executed by both systems and staff, traditionally such demonstrations can only be executed several times annually since they can be both expensive and disruptive to normal operations. Using computer-based simulation in conjunction with live testing can help fill in the gaps when providing answers regarding the adequacy of a recovery operation.
It can also be used to perform what-if analysis, such as evaluating the impacts of a technological or procedural improvement in recovery operations. In this fashion, it can be used to test the efficacy of proposed remedial measures and recovery procedures for both BIA and business continuity plans. This capability also makes computer simulation an excellent tool for business continuity training. Furthermore, extending the methodology to a Web-based platform makes the analysis accessible to many other business continuity stakeholders within the enterprise.
Matthew Liotine, Ph.D., is vice president of BLR Research, a firm specializing in business continuity analytics and software modeling. He is the author of Mission-Critical Network Planning (Artech House Publishing) on business continuity. Liotine is also an adjunct faculty member of the University of Illinois Graduate School of Business. He can be reached at firstname.lastname@example.org.