When looking at the fault resilience of your data, it really is reasonable to segment those efforts into “data protection” and “data availability.” This builds on the earlier premise that without the data, no other efforts will bear fruit.
The first goal should always be to ensure that the data is protected. You must be able to guarantee the integrity of the data at a minimum. Any technology that cannot satisfy that should be immediately discarded. Past that, you need to try to appreciate the value of the data – either in terms of dollars per minute or dollars per megabyte. Another way to say this is to ask, “If one hour of my data was lost, how much would it cost me to rebuild it (if that is even possible)?”
A similar question regarding a lost GB is equally appropriate. This simple query can tell us the recovery point objective (RPO) or maximum latency that is required for your data protection strategy. For a few people, this number might be “one day lost is acceptable” – meaning tape backup is an acceptable solution. For a few others, this number might be “one million dollars per minute” (such as a stock trading application) – which would call for a synchronous hardware mirrored solution. Unfortunately, most businesses fall in between – where one day lost is not acceptable, while synchronous hardware is not cost-viable. We will come back to this problem in a moment.
A related consideration is “data availability.” The metric for this is the recovery time objective (RTO) or the maximum downtime that is acceptable. Unfortunately, the two solutions discussed earlier offer the same dichotomy. Tape-based solutions typically offer recovery within hours, while synchronous hardware might provide instant access to the data (as long as the server infrastructure is properly architected). But similar to the earlier discussion, most of us find ourselves in between these options.
The Replication Alternative
This actually brings me to the second of my three earlier points – believe that there is at least one technology that you may be unaware of, but satisfies your goals into a cost-effective manner and perhaps with superior results. In many cases, one will need to look at the problem slightly differently or broaden one’s vision in order to see the answer.
In the scope of data protection and data availability, software-based replication tools are often the best answer. The most mature replication tools capture changes to files as they occur – and specifically, replication captures the actual bytes that change. And then, as bandwidth is available, the changed bytes are immediately transmitted to a target server.
With managed latency and throughput control, the result is an asynchronous to near-synchronous solution – putting the redundant data set between “current” and “seconds behind.” Because these tools are software-only, they do not require the same expensive hardware and typically cost a very small fraction of most synchronous solutions. This fills the gap discussed in our earlier RPO discussion, where the data protection is far superior to “last night’s backup” and yet must be more cost-effective than “mirrored hardware.”
Some of the replication tools available today also include failover technology. This allows a replication target server to stand in for a failed production resource. Failover typically occurs within seconds to minutes of the server outage, and is often configurable. Again, this fills the RTO gap between “hours of tape restore” and “millisecond redirect in the hardware.”
What About Clustering?
One more technology should be discussed at this point – clustering. As many people already know, clustering is based on two or more nodes with shared storage. A cluster software layer abstracts the services/functionality of the applications from the actual hardware. The result is that the application may run on or move between any of the nodes, with a great deal of transparency to the users. There are a few notable caveats with this approach (not withstanding the simple fact that not all applications can be clustered).
One reality is that failover within a cluster is not instantaneous. However long an application takes to start and mount its data is the same, regardless of whether that application is on a stand-alone server, a replicated target or a clustered node. So the difference in RTO (or failover time) is actually only the amount of time necessary for the failover node (replicated or clustered) to discern when the active node has failed.
The other major clustering caveat is that a traditional cluster has only one copy of the data, since it uses shared storage. Using replication technology, both the production and redundant platforms have their own independent copy of the data, albeit with the target copy perhaps being seconds old. If those seconds are acceptable in one’s RPO measurement, then the single-point-of-failure of the cluster’s shared storage can be eliminated by using replication. Admittedly, one can also eliminate the shared point of failure by putting synchronous hardware behind cluster, as shared disk. But this brings us back to the high cost factor.
As an interesting twist, some replication technologies can be leveraged within clusters. The result may be to geographically split the cluster by providing independent copies of the data or to treat the entire cluster as a source and send the data to a disaster recovery target location. But that is a longer discussion.
Please understand there is not one single technology or approach that meets every need. There are legitimate business environments where a one-day outage and a restore to the previous night’s backup is acceptable, and no other means is cost-effective – based on the low cost of the data. That is why you truly need to understand the value of the data, per minute or megabyte. Similarly, there are environments where every literal transaction has financial value or cannot be tolerated to be lost. In those cases, synchronous hardware coupled with a clustered front end is reasonable – because of the high cost of the data or lost productivity. My suggestion for you to consider is that there may be other technologies in between.
From the hundreds of clients I deal with, I find very few who are able to cost justify synchronous hardware. Those same environments will run some clustered applications, but they have already decided (as an emotional preconception, not as a technology or business assessment) that the majority of their servers cannot be made fault-tolerant in a cost-effective manner. This is where replication technology is often the only alternative, whereby multiple sources can replicate to a single target. This provides those environments with a scalable and cost-effective means of fault resilience.
This difference in attitude between traditional protection/availability approaches and the newer technologies is based on the third of the three truths we discussed. No one has as much at stake in your business continuance plan as you do. While most vendors and integrators really do take a sincere partnering approach toward making you successful, it is ultimately your success or failure when a server or facility goes down.
Jason Buffington has been working in the networking industry since 1989, with a majority of that time being focused on data protection. He is a business continuity planner, a Microsoft MCT and MCSE, and a Novell Master CNE. He currently serves as the director of business continuity for NSI Software, enabling high availability and disaster recovery solutions. He can be reached at email@example.com.