The CIO’s role varies widely from organization to organization and can range from "trusted advisor" to "resented overhead." Whatever the perception, the professional CIO knows that today’s business requirement mandates an executable DR plan, and that the DR plan be supportive of real business priorities. Certainly the CIO knows that on D-Day (disaster day), he or she will be under the brightest spotlight of their career as the organization scrambles to get back on its feet.
Fortunately, most CIOs and their staffs already know which business units are key to the organization’s success and profit. The prudent CIO and staff can capitalize on the knowledge they develop oiling the squeaky wheel, responding to daily crisis, and serving as the guardian of the organization’s key data. In many companies the CIO and IT staff know, often to a 95 percent accuracy, just which applications and which databases are key to the organization’s ability to do business. What they may not typically know is the actual value of the business units operation to the corporate bottom line. This information can be found in two ways: 1) a formal business impact analysis (BIA) performed as a component of a business continuance planning process; or 2) An estimate from the CFO based on the business unit’s contribution to the actual and projected bottom line.
Business Impact Analysis
While a CFO’s projection of bottom line contribution is often easy to find, the hard dollar and soft dollar impact on an organization in the period following a disaster is far more abstruse and requires a formal discipline such as the BIA to fully identify such costs.
While the former number is relatively easy to find and in the absence of a BIA, provides at least an empirical foundation for the DR decisions to be made, the BIA is a superior metric because, should an organization be incapable of conducting business operations, the BIA includes subjective and qualitative impacts as well as a calculation of the value over time and the impact over time. This gives a much more insightful understanding of the potential impact of a disaster.
The BIA can be the CIO’s best friend when it comes time to make the case for investment in DR infrastructure. It can also be the CEO’s best friend should it become necessary, after a disaster, to demonstrate due diligence in guarding the organization’s assets.
Assessing Business Impact and Risk
There are two key components in assessing appropriate levels of investment in DR. We have already discussed the first, the BIA, which assesses the impact on the business over time and enables the CIO to make an appropriate investment that is cognizant of the impact of both hard and soft costs of a disaster.
The second component, risk assessment, is much more difficult. Just how likely is a disaster? Often the subject of abstruse calculations deep within an insurance company’s actuarial department, risk assessment is indeed a complex and challenging task. Suffice to say that in today’s world, being unprepared carries sufficient penalties to justify visible, if not always prudent, investment in disaster recovery. For the CIO, investment levels required for DR are directly related to the RTO and RPO targets set by business units.
Because the IT infrastructure is so defined, visible, and consolidated, it is an easy target for auditors; there is a common perception (albeit demonstrably false) that the business unit needs only somewhere to go, a phone, and a PC connected to the data center and business can resume. Planning for those components is, of course, assumed to be the sole responsibility of the CIO.
Developing a DR Framework
So how does the prudent CIO develop a disaster recovery plan that is both pragmatic in its approach to risk assessment, preparation, and execution, and has some realistic chance of execution, at least in a reasonable test scenario?
Once the cost of downtime to the business unit is understood, the next step in developing a DR plan is to gain an understanding of just what it is that needs recovering for the business unit. Thus the IT team needs to develop a cross-referenced inventory of application, server, and storage assets by business unit. Once this inventory has been established, it can then be ranked into three to five categories, (tiers or classes of service) based on the business impact and business need.
In an organization with multiple business units, an initial review will no doubt identify a unique DR requirement for each business unit. This level of granularity cannot and need not be supported by a cost effective DR plan. Rather the various business needs are grouped into three to five tiers or classes. Each one of these tiers or classes will be based on the hard/soft dollar impact on the organization and on two key metrics that govern the recovery objectives of the business units: RPO and RTO.
Thus, those business units whose contribution to the bottom line is both significant and immediate will be in a higher class of service than a business unit whose contribution to the bottom line may be less immediately impacted by disaster. Once these groupings and categorizations are determined, it is key to ensure buy-in by the various CXOs of the organization. If this approval process is bypassed, it will surely occur with 20x20 hindsight after a disaster plan has been initiated. The review process can also filter out the more outrageous requirements that can arise.
In developing the attributes for each class of service, the prudent CIO usually assigns attributes to the top tier that include zero data loss and either no downtime or perhaps just a few minutes of down time. Subsequent tiers usually blow out the RTO to 24 hours, then 48 to 72 hours, then perhaps 72 to 96 hours.
RPO can often be less tolerant than RTO. Perhaps a day’s downtime can be lost without too great an impact, but the loss of any data that is unrecoverable, or cannot be reconstructed may be truly unacceptable. This situation is often seen where Tier 2 has the same RPO as Tier 1 but with more tolerance in time to recover. Tier 3 is often tape-based recovery and defacto RTO/RPO is often 24 to 48 hours old.
The tiering process often reveals that over time, as various interfaces are built, applications can become inextricably interdependent, tangled like the wire coat hangers in your clothes cupboard that resist all efforts to disentangle. Rather than spend the time it may take to map out all these interdependencies, it is often more prudent to simply group these applications into the same class of service based on the business unit they serve.
Once the initial tiering has been drafted, a refining process may be required as the IT Team finds that the technology infrastructure necessary to provide recovery in 30 minutes, 1 hour, 4 hours or 6 hours is substantially the same. That is, for any tier of recovery where the RPO is less than 8 to 24 hours, disk-based recovery is probably mandated. For RPOs of zero, 30 minutes, 2 hours, 4 hours, the IT Team may also find that the technology is a similar, synchronized replication. Once a set of realistic tiers have been decided and the RTO/RPO ranges agreed for each tier, planning to provide the DR capabilities for each tier can commence.
Pre-Requisites to DR
There are two pre-requisites to disaster recovery execution: data consolidation and server consolidation to a manageable number. These prerequisites are driven by the human limitations of managing data and servers beyond a certain physical limit in the target timeframes for recovery. It is extremely difficult, if not impossible, to recover say, 2,321 servers, each with its own direct attached storage, on day one of the recovery process. How can these numbers, not unusual numbers, be chunked down to a level where we could realistically expect day one recovery from the IT team (or at least the surviving members).
The first pre-requisite is data consolidation: consolidating data to a point that optimizes its protection. Invariably, this means a need for some form of networked storage. In smaller sites, something as primitive as a consolidated file server may be sufficient; in larger organizations, consolidation to a NAS or SAN device is considered a prerequisite.
The second pre-requisite is server consolidation. Servers must be consolidated to a number that can reasonably be recovered in the target times. This can be a complex process. It is necessary to determine the characteristics of a server platform that is required to support each tier of applications. Server consolidation requires an understanding of the impact of scaling outwards or scaling upwards.
Scaling outwards means using multiple machines to support an application. This is most commonly supported by blade servers perhaps booting from the SAN. Scaling upwards means consolidating applications of the same class of service onto larger and larger multi processor machines.
OK, so now we have identified classes of recovery in tiers that include RTO and RPO metrics. We have developed a cross-referenced inventory of application, server, and storage assets by business unit, and we have classified our inventory into three to five tiers based on our understanding of the tangible and intangible impact, to the company, of a disaster. We have consolidated storage to a manageable and more cost effective consolidated frame, and we have reduced the number of servers or at least the administrative overhead involved by our scaling and consolidation efforts. We don’t yet have DR, but we seem to have a much more efficient and less costly operation. It looks like DR may actually have a payback after all, one that comes from the more disciplined and cost effective infrastructure we have now built.
Before we can develop our DR plan further, we need to consider three issues that can lay waste the best DR plan.
1. Should backup be seen as the primary source of DR data? We typically find that backup is a function that is scheduled over a 24-hour period to touch every relevant server according to policy. This strategy is ideal for the recovery of a file or the recovery of a single server, but when we need to recover every server in the data center (in Tier 1) then all these servers may well be recovered to a different point in time. While physical integrity may have been preserved, logical integrity of interdependent applications may well be hopelessly corrupted. Snapshot technologies can provide a way to take a backup of all servers in a particular tier at a consistent point in time.
2. Where replication is used as the primary method of DR, data replication may be maintained either synchronously or asynchronously, however this is maintained at the physical write layer, and does not necessarily guarantee that the data will have logical integrity at point of a disaster sufficient for application restart.
3. Applications and their databases may well need to be specifically recovery aware so that interruptions to a logically related bracket of transactions can be recovered and restored to production operations within target RTO/RPO.
DR vs. Backup vs. Archiving
It is apparent that while current BU policies may be adequate for file recovery, and possibly even single server recovery, there may well be significant problems in multi-server recovery. This may well mean that we need a recovery solution for disaster that is in addition to the recovery solution for the typical "oops" situation of data loss or corruption, or server failure.
Additionally, legislative and regulatory directions are now causing organizations to take additional archive copies of their data for long-term retention and retrieval under an RTO that may require the data to be rendered outside of the application created formats.
This means we may need to treat our DR protection capabilities as a process separate from our file level protection (BU) and separate again from our archiving. We can no longer look to week four of the tape back-up cycle to provide all the functions of file recovery, DR recovery, and compliance driven archiving.
Realizing the Benefits
At this point in our DR preparation journey we should be looking good: consolidated storage, consolidated servers, revised policies for backup, archiving, and DR protection. Some applications are now more disaster aware. We have also gained some significant new insights into operational policy and procedures, more efficient systems administration, and cost savings from our new storage model. Who said DR had no ROI?
DR planning gave us the incentive to develop the disciplines and efficiencies we should have had in the first place. It looks like DR planning could be just good business practice. But we are not out of the woods yet. Our data is protected but how, who and what will access it, should a disaster be declared, has yet to be decided.
Few organizations can afford to duplicate their servers, even for Tier 1 applications, and leave them in a dark room on the DR site awaiting a disaster declaration. Many organizations attempt to utilize their development/test/QA environment, adopting a one-for-one policy (production to test) and financing the upgrade of the environment to support a production work load. Others opt for a third party to provide hardware on demand in the event of a disaster.
There are infinite variations on these themes but one thing is certain.If the designated DR hardware has not been brought into actual production usage, can you be sure it will really work when you really need it. The larger the numbers, the less likely all Tier 1 machines can be successfully restored to operations. We should also consider the host of exacerbating issues in a real disaster. Just two: Will your key staff want to travel? Will they be able to travel?
DR tests that do not result in the infrastructure being brought into production do not really prove anything at all. The only test that is realistic is one that results in production being transferred to the target DR configuration and operated as the production environment for at least 30 days. In an environment with, say, 2,400 servers in Tier 1 (the most important group) this would mean that an annual test would require that every month, the production workload of 200 servers would be transferred between production and DR configurations.
We have only scratched the surface here. There are many other dimensions to consider in a professional, prudent, and pragmatic DR plan that time, and space do not allow us to address in this article. But you can make a start now, and yes, you can make a difference.
Dick Benton brings the experience of an international IT executive, consulting practices manager, and technical pre-sales support manager to the role of storage and disaster recovery consulting. He is a senior consultant with GlassHouse Technology’s consulting practice. He holds an M.Ed. from Cambridge College.
"Appeared in DRJ's Winter 2007 Issue"