Driving the need for disaster tolerance
The need for disaster-tolerant solutions is driven by at least three factors: critical applications, decentralization/outsourcing, and around-the-clock reliability.
The more our operations and core business needs become electronic and dependant on information technology, a rapidly increasing number of routine applications are becoming absolutely critical and must always be up. Think of your supply chain, your Enterprise Resource Planning, your Customer Relationship Management. Even e-mail is now critical because realtime, global communication is an integral part of what we do.
The second driver is decentralization or, more precisely, the outsourcing of operational activities. In previous years, businesses kept a tight reign on all their internal operations such as sales, manufacturing, and shipping -- no matter what industry they were in. And thus they were in full control of the information technology that supported those activities. This gave them more control and reinforced a direct relationship with their customers.
Today, in an effort to streamline, many businesses want to divest themselves of all operations not directly associated with their core purpose. Simple! Therefore, they outsource to companies who have the expertise they do not. The plus side, of course, is that companies can save money, increase profits, and focus on their core competencies.
We mentioned the up side. But the down side is that, should there be a problem in any one of these outsourced systems, the business is not in business. In fact, by simplifying through outsourcing the business may have created a highly complex dilemma.
The third driver of disaster-tolerant solutions is around-the-clock reliability. With more applications considered critical and more potential points of failure, companies must assess their need for around-the-clock reliability. How acceptable is a moment -- or minutes -- of downtime? What are the risks? Where can loss be tolerated? How much is acceptable? And, of course, how much are they willing or able to spend to address these matters?
A simple formula
Economically speaking, the decision to implement a disaster-tolerant solution - or, more precisely, the level of disaster tolerance to be implemented -- is based on a simple formula: the cost of extended downtime and the risk of a potential loss should outweigh the cost of the disaster-tolerant solution and the supporting infrastructure. How one determines the elements of this deceptively simple equation and, further, what to do about it is another matter entirely!
And the matter mentioned earlier -- going for greater operational simplicity and ending up with potentially intractable systems complexity -- is a paradox that must be addressed to ensure appropriate disaster tolerance. It really is like trying to put a square peg in a round hole.
A three-step process
There is, unfortunately, no simple solution. But there are ways to approach the problem. Fundamental to success is the realization that a disaster-tolerant environment should be designed from a systemic and holistic approach, that is, with an understanding of the multi-dimensional nature of operational reality.
The pursuit of a disaster-tolerant solution is a three-step process but what you do with these steps and when you do it are critical! And it is also important to recognize that different parts of your organization will require different responses to these three steps.
Step 1 -- Determine operational characteristics in the context of the business model
Determine the operational characteristics in the context of your business model. This is a fancy way of saying: “figure out the key factors of the way you do business, how they are supported by your IT systems, and where the priorities are.”
How many of your applications are critical? Most or all? Do you outsource most of your non-critical functions? How much do you want to streamline your operations while leaving open the possibility of increasing points of failure? How important is around-the-clock reliability? What is your acceptable loss?
All of these questions factor in to your disaster-tolerant approach. At your first pass the most important characteristics to consider are transaction centricity and/or data centricity and recovery time and/or recovery point
Do you need fast recovery or recovery to the exact state prior to the disaster - or both? If you cannot resume processing within a second will it be inconvenient, seriously damaging, or catastrophic? Conversely, if you do not resume processing right where you left off, will it be inconvenient, seriously damaging, or catastrophic?
If you are running a Web site and you cannot keep up with user demand, business is lost as users grow impatient and click elsewhere. In fact, research conducted by Oracle Corporation has shown that customers will wait no more than seven seconds before moving on. Similarly, a transaction-centric operation like a financial trading floor requires complete integrity of transactions -- with no interruption --or the result may be huge losses. In these cases, recovery time is of the utmost importance.
A bank’s back office operation is a data-centric organization. It may withstand a little disruption, but when it restores its data, it better be accurate. Here, recovery point is the focus.
Step 2 -- Finding balance
Find the balance among three key aspects -- Technology, Services, and Procedures and Discipline -- that impact disaster tolerance.
Technology: physical and logical components that make up the IT and network environment, such as systems, equipment, software, network, data, storage, and power.
Services: remedial, preventive, service providers, third-party evaluations and reviews, off-site personnel, environmental concerns (HVAC, fire prevention, and so on.)
Procedures and Discipline: internal rules, policies, recovery plans, practices, drills, cross training of personnel, succession planning, and the discipline necessary to ensure implementation.
Having gotten this far in the approach we can now begin to build a model, a one-picture summary, if you will, of what we are addressing. We recommend that you use the chart as a touchstone or a reminder to ensure that you consider all the appropriate factors and do all of the necessary activities.
So far, we have three Aspects of concern: Technology, Services, and Procedures and Discipline. All of these must be applied against your business model.
This essentially sums it up. Figure out the operational requirements and then determine the Technology, Services, and Procedures and Discipline to apply in each case.
All this would be sufficient in a static world. But our environments are not static, not a slice in time. In fact they are by definition quite dynamic and require constant adjustment, updating, change, and improvement. All of which can throw a monkey wrench into any model that may work at a particular but singular stage.
Step 3 -- Addressing the dynamic nature of e-business environments
So we have the third step: addressing the dynamic nature of e-business environments. For each area to which you are applying some level of disaster tolerance you must constantly focus on planning, protecting, and (reality being what it is) recovering your resources.
For each aspect of concern (Technology, Services, and Procedures and Discipline), you need a process that begins with a plan, a way to protect the plan for successful implementation, and a recovery activity when an incident occurs.
Figure 1. The static Disaster Tolerance Planning Model showing Aspects and Key Elements of the Business Model.
Figure 2. The entire Disaster Tolerance Planning Model showing Aspects, Activities and Key Elements of the Business Model.
Planning is the quintessential piece of the puzzle. It requires you to fully examine the model, goals, and needs of your organization along all three aspects of concern.
You will need to ask the following questions: What is the business? How do you generate revenue, deliver products and services? What are your data needs, customer characteristics, supply chain?
What are the risks compared to the consequences? How likely is a flood or hurricane?
What is the risk of malicious attacks? What are the consequences if someone kicks the power cord out?
Not everyone needs a 24x7, year-round computing infrastructure. Similarly, not every part of your organization requires the same level of disaster tolerance. Do you need fast recovery or recovery to an exact state prior to the failure?
What will your organization do? After you analyze the system environment, you must architect the right disaster tolerance strategy. You need to decide the level of protection and restoration and how to acquire technologies, services, procedures and disciplines.
Once these things are understood you can then determine how you will do things, when they need to be done, and who will do them. And it is always helpful to frequently ask yourself, “Why?” This tends to keep you on track and avoid going down any blind alleys or inefficient paths.
Consider the financial, operational, technical, and personnel pieces of this puzzle.
And of course, remember that Murphy’s Law is pervasive!
During the Protect stage, you design, implement, and manage systems, resources, and procedures to support your plan. It is the most active stage and will require engagement on a daily basis.
On an on-going basis you must ask questions, adjust, document changes, test, rehearse, try, adjust -- all while conducting your day-to-day operations.
You will need to ask questions such as the following: What are the risks to the technology and data? What technology should we deploy? How can it address or minimize the risk?
How frequently should we test? How should we monitor internal and external services?
What are the maintenance requirements, procedures, and discipline?
How will we manage change? How do we address contingencies?
Recovery is the final aspect that ensures business continuity. You hope it will not have to reach this stage. But if a disaster strikes, and you have planned and protected your operations and data, recovery should perform exactly as you had expected. Well, maybe not exactly. However, if you properly planned and protected, you can meet your recovery priorities, adjust ad hoc, and have what at the time are the most important operations up, running, and accurate.
For example, one company’s recovery plan states that before anything is done, they must assess the current state of the business and its priorities and determine which operations support these priorities. Only after doing this analysis will begin recovery procedures.
We have now reached the point where we are clear on the business model and priorities, the best balance of our three aspects (Technology, Services, and Procedures and Discipline), and how to address the dynamic nature of the business and its supportive technology infrastructure.
Three steps. Three aspects. Three balance points. Everything you need to do to ensure the level of disaster tolerance and business continuity you require. With a solid grounding in this approach you can recast Mr. Mencken’s statement: “For every complex problem there is an answer that is clear, simple and . . . disaster tolerant.”
Daniel S. Klein is the High Availability and Disaster Tolerant Solutions Marketing Manager in Compaq Computer Corporation’s Custom Systems & Solutions Business Unit. During his 14 years at Compaq, he has worked with a wide variety of IT solutions in such markets as manufacturing, consumer packaged goods, health care, education, and government.
The author is indebted to Jeffrey Schiebe and Ron LaPedis for their contributions to this article.