Companies in the insurance industry employ actuaries to do sophisticated analysis of risk, focusing on the probabilities of certain events and calculating the proper ratios of likelihood and costs of those events. However, in the disaster recovery and business continuity fields, the same level of rigor and analysis isn’t often present. A simple explanation for this is that individual organizations don’t have their own large sample sets of data, such as probabilities of a particular system failing. And, in many cases, mitigation is perceived to be less expensive than precise actuarial analysis. In other words, precision doesn’t pay the same rewards to the field of business continuity that it does for the insurance business.
So, why is diversity then an extremely cost effective way to mitigate risk? The simple answer is that diversity protects an organization from having the same root problem that impacts both a primary and backup system simultaneously. A great example of this is in the area of operating systems or platforms. In the case of a virus, it is often the case that the same virus that infects a primary system could also impact its backup system. If the platform of the backup system was different, however, the likelihood of this occurrence would be dramatically reduced, if not eliminated.
Most DR strategies already take this concept into account in important ways, such as by using multiple sites in diverse geographies to reduce the likelihood of a regional event impacting primary and secondary sites. But this concept may have broader implications than what companies are currently considering.
An interesting case study in this can be found in messaging systems like Microsoft Exchange, especially in context of high availability solutions. Most approaches to providing high availability for Exchange systems utilize many of the same components as the primary system, such as running a secondary Exchange system on the same platform. And although Exchange running on a Microsoft platform may be a very reliable system, in the case of a corruption-type event, there is a much higher probability of the corruption event that caused the failure on the primary system impacting the secondary system.
Because of this, some of the solutions in this space utilize a diverse approach to the secondary system, limiting the commonality of major components, including the operating system and application hosting the messaging system. What’s especially interesting in this example is that even if the secondary system was dramatically less reliable on its own, it may still be much more reliable as a backup.
Another great example is tape backup. While industry studies have consistently shown that the reliability of a specific backup being 100 percent successful is typically much less reliable than other IT applications because it is almost completely diverse, it rarely is impacted by the same problems as a primary failure. So, again, despite the fact that it might be less reliable on the surface, as a backup it could be much more reliable because of its diversity. Tape backup offers diversity across many dimensions: it is a different media format, it is off-line versus on-line, it is usually stored in a different format and it can be stored in a different location.
An example where the lack of diversity reduces overall reliability is in replication. Most replication solutions replicate data from targets to sources that are running on the same platforms with the same applications. As a result, issues that affect the source device can also impact the target device at the same time. And, although replication technology has matured to the point where it is very reliable, any issue with the replication technology itself can impact both primary and backup systems. As a result, replication technology needs to be that much more reliable. Thought of another way, primary and backup systems that share components require those components to be, literally, exponentially more reliable to achieve the same result as the same systems using diverse components.
From an economic perspective, effective understanding of the concept of diversity and how diversity applies to redundancy can save companies money. Instead of using the same ultra highly reliable system for backup, a company may be able to get a statistically more reliable backup using a secondary system that itself might be lest reliable (think of Mean Time Between Failure ratings), but that doesn’t share the same dependencies or couldn’t be impacted by the same cause of the primary system failure.
With the benefits of diversity come a number of risks and costs to consider. Any time you introduce a new technology, process or person(s), there are cost and complexity considerations. These risks need to be weighed against the statistical advantages of diversity. In most cases, diversity is most useful in areas with especially high risk and/or probability of failure.
The diversity concept doesn’t extend just to technology or location. It also extends to people and processes. It is a good filter to use when making decisions around business continuity, whether considering systems, processes, or people. It is often the most cost effective way to achieve an acceptable level of recoverability across any type of failure or disaster.
Outsourcing of DR services is especially effective, precisely because of the diversity associated with outsourcing. In any number of circumstances where a recovery might be necessary, an outsourced solution dramatically increases the likelihood of success.
"Appeared in DRJ's Winter 2008 Issue"