Fall World 2013

Conference & Exhibit

Attend The #1 BC/DR Event!

Spring Journal

Volume 26, Issue 2

Full Contents Now Available!

Aesop's Take On Strategy Development

Written by  GREGG JACOBSEN, CBCP Tuesday, 06 January 2009 14:05

The Man and His Two Wives

In a country where men could have more than one wife, a certain Man, whose hair was fast becoming white, had two: one older than himself and one much younger. The young wife, being of sparkling and lively spirit, did not want people to think she had an old Man for a husband, so used to pull out as many white hairs as she could. The old wife, on the other hand, did not wish to seem older than her husband, so she used to pull out the black hairs. This went on until, between them both, they made the poor Man quite bald.

I’m willing to bet all the change in my pocket that our ancient Greek slave and fabulist never set foot in a raised floor environment. Yet he tells a compelling story that has applicability to the denizens of the data center. It begins with our poor, though some (men) would think he was lucky, husband, presumably the master of these women, as was the custom in those days.

However, they have self-serving motives with unhappy results for all. And what, one might ask, does this have to do with strategy development? The answer is that, as in our fable, there are two unique interests involved in system architecture: availability strategy and data protection strategy, and they, like the Wives, represent differing needs to be satisfied.

IT Services Situation Report

This should not be news to anyone who works in IT services: unless one is in the business of selling IT services, IT service is a cost center, or, in bean-counter-speak, overhead. Overhead is bad, a necessary evil on the balance sheet.

As evidence, look at data center staffing: only Captain Jack Sparrow’s Black Pearl had a more skeleton crew. Among the negative pressures of cost-control is the tendency to “simplify” internal IT service offerings using some simple approach, such as numbered “tier levels” or metal-based service levels, e.g., platinum, gold, silver, etc. They might look like this:

Tier or Metal RTO RPO
Tier 1 or Platinum 0 to 24 hours 4 to 24 hours
Tier 2 or Gold 24 to 48 hours 4 to 24 hours
Tier 3 or Silver >24 hours >24 hours



Your mileage may vary, but this approach simplifies life for the data center folks. However, it also presents the business users with choices of dubious value, since the business operations’ requirements may not fit the tier levels.


For example, some processes can be continued by employing manual workarounds for a few days while the system is down but have zero tolerance for data loss. Using the table above, to protect the data as required, they would have to pay for platinum service, even though they don’t need the high-availability (and cost) solution. But IT organizations frequently offer this on a take-it-or-leave-it basis, since tailoring every application/system to meet users’ actual requirements is too “difficult to manage.” Of course, it is easy to imagine that, given these kinds of choices, to expect questions like: “How much for the copper service level? Is the tin level cheaper?”

It’s the Terminology, Stupid!
Before lurching headlong into the travails of our trio, it is well worth a moment to draw attention to the footnotes. In our profession, terminology is vital to establish and maintain our clients’/employers’ understanding of why certain availability and data protection strategies are what they need. The obstacle is getting them to do a business impact analysis (BIA) to determine the risk of loss of IT service availability and data. But most IT service providers and in-house practitioners meet with enormous resistance to doing a BIA. Clients/employers would rather eat broken glass than go through one. Facing this prospective struggle, an alternate approach is used, and this is where “word games” come into play.

RTO is fairly straight-forward: how quickly does the application/service need to be back up? This term is seldom a subject of challenge, mainly because it is easy to understand: availability is “architected” so the service is as available as the enterprise needs it to be. However, RPO gets some interesting, though often misleading and, for the IT service provider, self-serving twists. The most notable (and recently observed) is this definition of RPO: “The point in time to which data is to be recovered.” In reality, this definition is a paraphrase of: “RPO is the last piece of data that was safely stored outside the data center.” In other words, it’s just a target the IT service provider knows they can always hit, as opposed to what the enterprise needs to have protected from loss. The difference could be many days worth of transactions or vital records with substantial downside risk exposures, including financial and regulatory, to name a few.

And What of Man and His Wives?
To be sure, they are not forgotten, but the above “back story” needed to be covered to make the analogy clear. For purposes of this discussion, only IT services (the Wives) are addressed here, since the preponderance of practitioners seldom get involved in operational (business process) continuity. It’s not that these principles don’t apply to the other resources supporting critical organizational operations, but nowhere else is there so much confusion about their application than in IT services.

After the risk evaluation is done and mitigating control measures are put in place, and after the business impact analysis (BIA) is done, the data is ready to analyze. By probability rating each impact a risk-weighted recovery time objective (RTO) can be established. This becomes the system’s availability requirement, whether in nanoseconds, minutes, days, or weeks. This is Wife No. 1.
 



"Appeared in DRJ's Winter 2009 Issue"

But then there is Wife No. 2: tolerance for loss of data, i.e., recovery point objective (RPO). It has been observed that some operations, e.g., banking, has virtually zero tolerance for outage and data loss, which is certainly understandable. Transactions are often comprised of very large amounts of money, and mere milliseconds can mean a lot of it.

However, there are cases wherein the business isn’t concerned with not having a system up and running very quickly, because paper trails or other process elements enable productivity to be sustained. But the tolerance for data loss may be zero because of legal and/or regulatory reasons, e.g. pharmaceutical manufacturing, where losing clinical trial data can delay FDA approval of new products for months to years. Yet practitioners still find themselves torn between such seemingly disparate demands. They needn’t be.

Availability Strategy Choices
Here is one way of phrasing an IT service continuity policy: “If an application is worth developing and putting into production, it is worth recovering ... sooner or later.” Availability requirements can range from zero seconds to 10 o’clock next summer. When the RTO is zero, the architecture will (or should) be active-active or hot failover. That is, two instances of the system/application are running in parallel and, for regional threat/risk mitigation reasons, in locations separated by a suitable distance.

In between these two extremes, are RTOs that are somewhat “stepped” in nature, with respect to the varied capabilities of the architecture. From active-active, there is “warm back-up,” where the failover is manual to at least some extent, which can meet RTOs from 10 or 15 minutes to several hours. Then there is the “cold back-up, where the recovery server must be configured and restored, the application source code and back-up data loaded before putting it into production, which can meet RTOs in the 12 to 48-hour range.
Once the RTO reaches 48 to 72 hours or longer, manual restoration on cold servers, whether at another in-house data center or at a recovery vendor location (hotsite), this strategy becomes feasible, but also may add cost issues related to travel expense to support exercising the plans.
At the extreme, where there is high tolerance for outage and data loss, a “best effort” recovery plan will do. For the uninitiated, a “best effort” plan is one in which no equipment is purchased or subscribed from a recovery vendor prior to a disaster event. Rather, a system architect should document the required resources that must be securely stored offsite:

  • Hardware configuration
      - Manufacturer
      - Model number
      - Operating system, version level,  services packs, etc.
      - Number, type and speed of CPU’s
      - Memory and attached disk space
      - Licenses, keys, and any other items the application may require to activate the system, and 
  • Copies of
  • Application source code,
  • Network diagram, and
Data back-up media.

Some practitioners may object to writing a plan where there is no strategy to be implemented until a disaster strikes, but during Y2K preparations, it was realized that the above items were essential. Somebody recognized that back-up media alone may enable an application to run. But without the source code, the restored instance cannot be patched, fixed, or updated. Thus, the documentation and secured items listed above is the “plan implementation.”

Data Protection Strategy Choices
Data center operations and their operating costs, play a significant role in data protection strategy selection. Today’s technology offers far greater storage capacity per square foot of floor space than was imagined even 10 years ago, and data transmission speeds have vaulted at least highly. These factors weigh against labor costs for people who handle back-up media, most notably tape cartridges.

When the tolerance for data loss (RPO) is from zero to 24 hours, replicating to a remote site is the strategy, since having an offsite storage vendor make more than one pick-up per day at least doubles the labor costs and vendor fees to cut the achievable RPO to 12 hours. By jumping to replication, tape handling labor is zero. If back-up tapes are desired or required by regulatory mandate, the storage architecture can simply add a tape silo and terabytes of data can be automatically created and stored with only the handling labor for archiving the oldest required tapes in racks.

Once the RPO exceeds 24 hours, the strategy may shift to restoring from back-up tapes, since daily pick-ups for offsite storage is usually reliably controlled. From there, higher RPOs are a matter of how data center operations chooses to send tapes offsite.

And the Man?
The Man is the business: he finds himself ill-served by the competing interests of Wives with presumably different objectives. But need they behave thusly?

Not really. Computing platforms can be designed to meet availability requirements and storage architecture can likewise be crafted to protect the data as well as is required.

Does he really need to be plucked to baldness? Was he even asked?

It seems not, but then, is this any different than an IT department offering service “menus” with the (business) customers choosing one of three or more options (“No Substitutions, please”)? Rather, the ideal is a menu that offers two-column choices: “Let’s have the four-hour RTO and the 48-hour RPO, please.”

So, our Man decides to tell his Wives that he prefers the salt-and-pepper look, and they’ll just have to get used to it. The BIA establishes the down-side risk of being all black-haired or all white-haired, but he winds up with a chrome dome. And no one was happy, were they?

Gregg Jacobsen has an MBA in organization development and is a Certified Business Continuity Professional (CBCP) with more than 13 years experience in business operations and IT service continuity practice, both as a consultant and in-house practitioner. He is a BC/DR coordinator with Siemens IT Solutions and Services, Inc., serving their IT outsourcing clients. He is very active in the profession, including prior service as chapter president of the Los Angeles chapter of the Association of Contingency Planners and chair of the ACP Presidents’ Council, and is currently chair of the ACP Hall of Fame Judging Committee. He live and works in Westlake Village, Calif.

Login to post comments