A lot of money and time are spent testing emergency management and disaster recovery (DR) plans. But how well does the test data indicate the ability to recover from an "actual" event? Is the test designed to determine which procedures do not work as planned and/or expose gaps that can slow or stop the recovery and resumption process? Is test data analyzed to explain the practical realities of responding to a real event? It should be and it is not difficult to do. This article describes one simple method to predict recovery and business resumption response based on historical test data.
Predicting disaster response can never be an exact science. But by including extrapolation and correlation analysis, they can provide visibility into what can happen in a real event. Collecting the right data during a test makes it possible to extrapolate and makes the test results more valuable.
To project recovery times onto a real event, calculations are performed on historical production and test data that represents each sequential step of the business resumption process (critical path). Precise prediction down to the minute or even hour is not practical. It’s better to use the extrapolation tools to get a sense of how events are likely to play out, so that decision makers can follow a consistent road map when a real event occurs.
First, assume that data transmission speed is constant. Then, select a process thread to model, and if it’s not mapped, use a simple flow diagram to show the recovery sequence at a high level. It’s advisable to avoid getting into too much technical detail at this point. The correlation models will deal with technical detail and interrelationships. However, it is important to note the large databases.
Next, refer to recent historical backup performance. It is important to emphasize "recent." Data is continually changing, therefore old legacy data may not be relevant.
Recovery time has a strong relationship to back-up time. Magnetic tape is physical media, and there is variation in the mechanical tape storage process. The actual tape itself is contained within a plastic cartridge assembly that is inserted into the tape drive and mechanically locked in place for the writing process. Data is written from production storage to tape through a write head in the tape drive. Back-up time is limited by the speed of the tape-writing process which is correlated to mechanical drive speed and disk and network activity at the time the backup is taken. Some noise is present in every component, and it is difficult to analyze and capture the exact effect of drive, disk and network activity. To simplify, assume that both factors reduce the write time by 10 percent. For modeling purposes this is called the noise factor. In many cases back-up throughput is increased by using multiple tape drives in an automated tape library. To model multi-channel streaming, throughput is multiplied by the number of drives and a drive ratio is factored into account for overhead required to manage multiple write heads. The model offered for automated back up time becomes:
Back Up Time (Bt) = data base size / (factory throughput x Ndrives x (drive ratio - noise factor)
NOTE: For an LTO-2 model drive, the factory throughput is listed at 108 GB/hour.
The back-up time for a 4,000 GB OLTP database stored on 10 tapes using five tape drives would be calculated as Bt = (4,000GB / 108GB/hr x 5 drives x .80) or Bt = about 9 ½ hours. Back-up timings can be captured and averaged to provide better accuracy, but this rule of thumb should suffice for planning.
Restore and recovery time is more complex. Tape-based recovery requires time to recover the operating system, restore and configure the media server, and establish reliable communication with the tape equipment. This set-up time has to be accounted for to accurately predict recovery time. If actual setup time (St) was collected from the last DR test, the equation for predicting recovery time becomes:
Recovery Time RT = St + Bti (where i = 1 to n number of systems recovered)
It is important to note that the tape media server is a lynch pin in the recovery process. There is a positive correlation between the tape media server set up time and total recovery time shown on the chart below:
RT can now be used as the basis for extrapolating test results to a real disaster event. The first step lists the simplifying assumptions:
Predicating what is likely to occur provides more insight into risk implications as well as providing a useful decision-making tool when a real event strikes. With this type of roadmap the emergency management team can provide guidance and updates to interested parties such as customers, suppliers, employees, and media on when the systems will be available again.
"Appeared in DRJ's Fall 2007 Issue"