This article describes a rational, systematic approach to properly sequencing the restoration of IT infrastructure components to aid in resource allocation, scenario evaluation, and actual recovery during an event.
IT Technology Trap
IT services can be both a blessing and a curse. Because so much business function is dependent upon potentially fickle IT systems, risk is not limited to classic corporate functions like billing, e-mail, and file storage. IT often supports seemingly mundane operations like door locking and telephone use. We choose to complicate our lives with IT for the greater features and efficiency it delivers, but IT can quickly become a technology trap, leaving us helpless when it fails. It’s why IT systems require such explicit attention to BC/DR discipline. It’s no surprise most organizations’ BC/DR planning begins in the IT organization.
The dependencies behind our IT supported systems are often non-intuitive. Business functions in one city may be critically dependent upon IT systems in another city, often supported by staff who don’t fully understand how those systems are used by the company. When multiple systems fail, how do these support personnel decide what to restore first, second, third, and so on? That’s the problem we’ll answer with this article, and as we do, cover new ways of systematically tackling the problem for large IT environments.
We begin by researching and documenting our IT and business environment dependencies and developing diagrams which model information that is useful for recovery planning. We include physical objects like servers, switches, and UPSs as well as abstract objects like vendors, business processes, and business functions. These objects are nodes in our model and appear visually as different shapes or icons. Dependency relationships between objects are visualized as a line with an arrow between the shapes, also called a directed edge.
Diagrams of nodes and edges in general are called graphs. Perhaps you’ve heard mention of these terms in the context of BC/DR, or maybe they’re new to you. The science itself is called graph theory, and although it first originated in the 1700s, it garnered a lot of research attention as mankind began studying the problems of building telephone networks. Today we often hear of its use in Google’s PageRank algorithm or in the context of social networks like Facebook.
In Google’s case, they model an immensely large graph of Web pages linked together and analyze it to determine and convey relevance and reputation amongst those pages. When you search the Web using Google, the results you see come from analyzing that graph.
In our case we’ll be analyzing a graph with the goal of choosing which IT resources should be restored first. That ordering is what graph theorists call a “topological order.” The order is based on certain characteristics of how one node relates to other nodes in the graph. In our case it’s how the IT resource is positioned in the dependency hierarchy. First, we’ll look at an example of a dependency model and demonstrate how to determine the proper topological ordering for service restoration. Then we’ll cover a second criteria for ordering recovery, which you’re probably more familiar with, based on value or priority metrics and see how the two can be used in conjunction to best choose the sequence to restore components.
The topological order is revealed by looking at the graph of dependencies.
In the Fall 2009 DRJ issue I introduced the technique of dependency modeling. I showed how the BC/DR practitioner can facilitate the discovery and documentation of everything from high level business functions down to lower level resources such as IT services, applications, servers, and power supplies to foster understanding of the environment. Here we employ that dependency model to systematically solve the specific problem of restoration ordering.
In Figure 1 we see a typical dependency model for two business processes: payroll and sales. To the right of those objects are their dependencies, that is, the IT resources they need to function. Let’s suppose a fire has wiped out the datacenter housing all the hardware assets which supports these business processes, and it’s time to rebuild. Because of the interdependency between components, there’s a preferred order for restoring them. If a database depends upon a fileserver, that fileserver should be brought up first so it’s available when the database is brought back online.
It doesn’t matter if the architecture of the system is simple client-server, three-tiered, or a hodgepodge of services which grew together organically over time. Whatever it is, that architecture constrains the order components should be restored. More components obviously means more work to recover, but the quantity and obscurity of dependencies in complex architectures add an additional risk. They increase the odds that recovery workers will have to abort, or defer, complex restoration tasks to focus on overlooked subsystems which must be restored first. Unplanned context switching causes stress among personnel as the shifts of focus cause confusion, resource bottlenecks, and generally extend the duration of outages. Proper restoration sequencing seeks to reduce or eliminate these scheduling mistakes.
The correct rebuild order can be determined systematically from the dependency tree. It is determined by starting with all nodes (IT resources) that have no unmet dependencies. Looking at this graph, that means starting with resources toward the right which have no “needs” relationships to other resources. Those resources should be restored first, because according to the model, they can run independently since they depend on nothing else. Once those initial resources are up and running, additional resources become appropriate for restoration, and recovery staff can move on to those tasks.
In general the restoration effort progresses from right to left of this graph. We can even go as far as reducing this graph to an ordinal restoration list as seen in Table 1. First we bring up the servers since they have no dependencies. Once the servers are up, we can recover services oracle, apache, and ldap because they have no unmet dependencies. Once those services are up, proceed with the databases and so on. This list could be prepared ahead of time, or staff may refer directly to the dependency tree during restoration.
During a crisis, technical staff will be hard at work restoring systems or consulting other technicians. BC/DR managers, newer workers, and temporary assistants can come up to speed quickly with this kind of documentation which is not overly detailed and clearly shows the basis for the scheduling choices. A printout of the graph can be used to check off work items as they’re completed, tracking overall progress and aiding time-to-recovery projections.
In summary, our list is the topological ordering of IT resources to be restored. The order becomes evident once the objects and dependencies are organized into a graph. It’s a clearly documented, repeatable approach that scales well for complex IT supported business functions.
But our discussion isn’t over yet. If you recall, I said there were two considerations we should make when determining the restoration order. If you’ve followed along closely, you may have realized that it makes no difference at this point whether we restore apache before oracle, or oracle before apache since neither depends upon the other either directly or indirectly.
The topological order satisfies only the logical constraints within the system’s architecture which enables smooth workflow. There’s yet another criteria important to our ordering problem: how to prioritize limited resources when multiple restoration tasks exist? For instance, if our system administrators can recover only one service at a time, should they do apache first or oracle? Any preference at this microscopic level for one component over another originates from the business valuing certain business functions more than others. This preference is recorded by traditional MTD, RTO, or tier classifications.
Not every company describes importance or priority the same way. Some use MTD (maximum tolerable downtime), while others use RTO (recovery time objective) or tiers or some other measure of criticality. What’s important is that it’s consistent, and the metrics enable comparison of one resource’s priority to another. For the sake of simplicity, let’s say we classify our resources into tiers, where tier 1 is the most critical and tier 2 is less critical.
The simple rule we follow is this: when multiple restoration opportunities exist, work on highest priority (or most valuable) resources first. It’s so obvious it almost goes without saying, but in practice it’s difficult to apply since it’s usually only the higher-level objects of our model that have been assigned priority metrics from consultation with managers or users. Fortunately, by using the dependency model, we can convey those priorities from the lofty heights of business functions and processes to supporting infrastructure components such as software and servers where actual human work efforts are expended.
Dependencies Convey Priority
If an object supports two business functions with different priorities, the object inherits the higher of the two function’s priorities. In our example, since oracle supports both tier 1 and tier 2 business processes, it should be considered a tier 1 service.
We’re using the dependency model again, not to find topological order this time, but to expand the breadth of our priority metrics to all objects in the model.
Notice how the BC/DR specialist is not making any fuzzy judgment calls here, but simply assembling information from managers, users, and technical staff and using it effectively to produce results that none of those subgroups could produce on their own because each lacks pieces of the puzzle!
The BC/DR specialist now has the two main criteria for choosing how to order the tasks of recovery: topological constraints and value metrics. Both of those attributes can be determined for every object represented in the dependency model. Now you know why certain IT components are more important than others and logically what must be restored ahead of what to plan supply logistics and ensure smooth workflow. The two criteria are combined like this: work in topological order and prefer higher value (or priority) components first as this will get the high value businesses functions operational soonest. Applying both criteria produces an ordering as seen in Table 2.
The analysis described here can be conducted long before a crisis to help in planning or simulating recovery and estimating realistic time frames and logistical constraints associated with various failure scenarios. If an adverse event does occur, these preparations will reduce confusion, mistakes, and promote enlightened participation by empowering everyone with a schedule for coordinating restoration activities and educating everyone involved.
Hopefully, I’ve succeeded in demonstrating that dependency modeling using graph theory provides a foundation for systematically analyzing complex IT environments. When intuition is overwhelmed, explicit management of dependency knowledge helps solve the scheduling problem to improve planning and ultimate recovery.
Daniel Evenson is the CTO of Pathway Systems. Before that he was a Unix and security consultant and systems architect for several large corporations and government agencies. He is an expert in the modeling and simulation of complex information.