|
|
||
| DISASTER
RECOVERY SENIOR
EDITOR ASSOCIATE
EDITOR COPY
EDITORS ADVERTISING _____________ Corporate President/CEO Vice
President
CONFERENCE REGISTRAR EXECUTIVE
COUNCIL
|
Click Here for a Printable Version BC Analysis Using Computer Simulation By MATTHEW LIOTINE, Ph.D. Simulation modeling and analysis has long been a tool used to analyze complex systems and processes. Once known to have staple use in analyzing processes requiring fail-safe operation such as those found in military, aerospace, telecom and nuclear applications, computer simulation has found a new home in analyzing business operations, and in particular, business continuity operations. Those familiar with business continuity will recognize that recovery operations pertaining to information technology (IT) infrastructure can be quite complex. Thus, computer simulation can be the perfect tool to analyze such operations prior to physically implementing them. Because of the random nature of outages, the analysis of recovery
operations lends itself more towards stochastic techniques versus deterministic.
Although deterministic analytical techniques have had widespread use
to estimate recovery operations, they typically present a snapshot view
of a perfect world. Since we live in an imperfect world, however, these
techniques fail to capture variation in how recovery operations are
executed – variations which can lead to the inability to adequately
satisfy recovery objectives. Even the analytical solution of complex
systems and processes using state-based techniques such as Markov analysis
can often be mathematically complex.
Proactive business interruption simulation is a modeling technique that can be used to characterize failure and recovery scenarios in response to different types of adverse conditions or events. It can be applied to almost any level of an IT infrastructure – from an entire enterprise down to a single system or component level. When it comes to recovery, time is of the essence. For this reason, this analysis involves the creation of an event graph that characterizes the different states of recovery operation and computes the time to transition between states, based on observed or assumed probability distributions. The graph is then evaluated as a numerical model. This type of simulation model is often referred to as a discrete simulation model since the states of the recovery operation can change instantaneously at different points in time. These changes are described by probability distributions that can be either discrete or continuous in nature. Although knowledge of the probability distribution of different events within a recovery operation can be obtained from historical data, most often such data is unavailable. Instead, some well known probability distributions are assumed for different types of operations. Furthermore, users can specify their own distributions based on experience, educated guess or historical data, if available, in lieu of the default distributions. Two general types of simulation models are involved: micro-simulation and macro-simulation. In micro-simulation, recovery operations are modeled as a single server queuing system. An outage event is continuously repeated and the recovery operations in response are noted. By repeating trial outages and observing key recovery statistics, inferences can be made regarding the efficacy of a recovery operation. This type of simulation is known as Monte-Carlo simulation, whereby samples are repeatedly obtained from known probability distributions for the parameters in the recovery model and the resulting output measures are recomputed with each trial. Such simulation can be used to analyze disaster recovery (DR) test scenarios. DR tests are often executed using actual IT systems and personnel to assess whether recovery objectives stipulated in a DR plan can be satisfied. However, exercising such tests with actual systems and personnel several hundred times is rarely feasible and can be quite expensive, let alone disruptive to operations. On the other hand, simulation analysis uses a mathematical representation of the recovery such that simulated outages are replicated on the computer, several hundred times if necessary, and pertinent recovery statistics are produced. Macro-simulation incorporates many of the same features of micro-simulation, but with additional information regarding outage probability. Unlike the single-queue model used in micro-simulation, macro-simulation models an IT environment as a multi-server queuing system. In this case, many parallel recovery operations in response to multiple simultaneous outages can be characterized. Such simulations can be used to determine the effect of improved recovery operations in a large IT environment based on the likelihood of outages. Recovery in response to widespread rolling outages can thus be simulated and analyzed. In this case, simulated outage events are generated based on historically observed or assumed probability distributions. Observed data characterizing mean time between service outages can be a valuable input to identifying outage probability distributions. However, such data may only capture threats that have been known to historically plague an IT operation. This is why a detailed business impact analysis (BIA) can be an important precursor to both micro-simulation and macro-simulation since it can unveil those threats, either known or unknown, that are likely to expose an enterprise to the highest degree of risk. With this knowledge in hand, such threats can then be incorporated into the proactive business interruption simulation framework to test resulting recovery impacts. Figure 1 illustrates how simulation methodology fits into the overall business continuity planning process.
Case Study
Service requests and responses are generated between the primary data center and users over the Internet. The recovery site is positioned as a warm site, implying that data transfers are mirrored to the site asynchronously over the wide area network (WAN) at periodic snapshot intervals (SIs). The warm site is kept abreast of the primary data center’s operational state through the use of heartbeat links provisioned over the WAN. Keep-alive packets are periodically issued from the primary data center to the warm site over these links. Servers at both the primary data center and the warm site are arranged as a wide area cluster, such that failover would occur in the event loss of heartbeat is detected. Concurrent with failover, global load balancers situated at the Internet service provider locations would gradually redirect Web queries to the warm site. Because of the warm site arrangement, detection and failover are not instantaneous and involve a delayed reaction depending on the status of application processing and data mirroring. Although the detection and failover time windows can vary somewhat, the time thresholds for these tasks can be reasonably identified by virtue of the technologies being utilized. However, the wild card in this arrangement is the mean time to restore (MMTR), which can vary widely depending on the precise nature of the outage. Hardware failures, application hangs, viruses, intrusion detections and environmental disasters (e.g. fire, floods, etc.) each imply varying degrees of recovery operation. Here, historical data or knowledge on repair and restoration times could provide valuable insight into the variability of these tasks. The above scenario is characterized using an event graph model that encompasses each of the recovery tasks and states. Inputs to the model represent mean times versus absolute times for each of the recovery tasks. Probability distributions for these can be specified as well. The simulation then evaluates the behavioral response of the recovery operation to an outage, taking into account the degrees of uncertainty in each recovery task, and consequently the variation in the overall operation. A run of 100 recovery scenarios can be easily repeated if desired to evaluate whether recovery objectives can be met, given real-world variation and uncertainties in recovery. Figure 3 shows an example of the type of output that can be produced by the simulation. Simulation runs of 100 trial outages can be itemized using a detailed event log from which summary distributions can be derived. Each entry in the log summarizes statistics for one simulated trial outage and the corresponding recovery operation. To truly characterize the variability in recovery, many simulated trial outages are required. Applying the Monte Carlo principle, random probabilities are entered into the specified probability distributions for each trial and a corresponding time statistic is obtained for each of the events in the recovery graph model. Repeating this a hundred times or so thus produces the log entries and consequently recovery distributions (Note: Figure 3 shows only the first 15 entries).
The simulation then summarizes the data into frequency distributions. Two distributions can be developed showing the expected recovery time (ERT) that could be realized by the simulated recovery environment – one for service ERT (SERT) and one for operational ERT (OERT). SERT is the observed recovery time for service continuation, which in this case are Web transactions. While SERT can capture variability in service interruption from a user perspective and can be compared with a service RTO, OERT characterizes recovery of the overall operation and thus can be compared with a target operational RTO. This distinction is made for an important reason. Although warm site failover mechanisms might enable reasonable continuation of service, full recovery is not achieved until primary site operation is fully restored. Thus, it is imperative to distinguish between service and operational recovery to better characterize and analyze recovery viability. Case Study Results Yet, dynamic simulation of this recovery operation reveals the potential for violating these performance envelope thresholds. A significant number of runs show the observed SERT exceeding the service RTO by one to two minutes. Although this may not seem significant, it could have implications with respect to satisfying service criteria within customer service level agreements (SLAs). On the other hand, the OERT stays well within the operational RTO, indicating that the overall recovery operation as modeled is satisfactory within stated recovery objectives. However, one area of concern is the fact that the observed data loss expectancy (DLE) can exceed the RPO, and consequently the SI, up to and even slightly beyond 10 percent of the time. Although SI intervals are violated a small percent of the time, this could indicate the potential for mirroring corrupted data unless mirroring is halted and data is frozen following the outage. These results thus signal the need for revisiting the mirroring technology in use, and perhaps reviewing the stipulated RPO objective. Conclusion It can also be used to perform what-if analysis, such as evaluating
the impacts of a technological or procedural improvement in recovery
operations. In this fashion, it can be used to test the efficacy of
proposed remedial measures and recovery procedures for both BIA and
business continuity plans. This capability also makes computer simulation
an excellent tool for business continuity training. Furthermore, extending
the methodology to a Web-based platform makes the analysis accessible
to many other business continuity stakeholders within the enterprise.
©Copyright 2005 Systems Support Inc. All rights reserved. Reproduction in whole or in part in any form or medium without the express written permission of System Support Inc. is prohibited.
|