Computer based Modeling and Simulation for BC/DR
- Published on Tuesday, 16 February 2010 16:33
- Written by Daniel Evenson
So far most simulations we've encountered in business continuity and disaster recovery planning have been limited to the table top variety conducted at a conference or perhaps coordinated within your organization internally. Such events typically involve groups of people role playing to act out events of a pretend disaster situation.
It's a useful way to raise awareness by helping participants understand how a particular disaster might play out and where some problem areas are. However, that's not the type of simulation I wish to discuss here. Let's talk about simulations conducted on a computer and their ability to help the BC/DR planning effort.
Scientists model and simulate global climate and engineers study vehicle design choices and their effect on crash safety. This type of simulation permits study of a complex system to better understand its behavior in various situations. For example, car designers might simulate vehicle crashes to determine the best method for attaching engines to cars, or where to build in crumple zones.
In the context of BC/DR, we wish to better understand how a company or organization will behave during a natural disaster or more mundane failures like loss of power or human mistakes. If we model how our company is put together, simulations can be run using that model to answer questions pertinent to BC/DR.
Climate researchers model how global heat flows are affected by atmospheric composition, solar radiation and ocean currents to name just a few components of their models. They focus only on limited aspects of planet earth. Likewise, to simulate business impacts we need not model everything about a company, but instead model only sources of potential business disruption and the mechanism by which initial problems propagate into larger ones.
Business failures propagate via dependency relationships. A company's order taker needs the telephone and the ordering application. Shipping needs the bar code reader which accesses the product database to assemble orders. Sales managers need the sales report. Each company has a unique and intricate web of dependencies connecting everything from high level business functions to low level resources like computers and vendor relationships. Disrupt part of that web, and the effects may propagate far and fast.
This is the system we model, what we seek to better understand.
Over time we build a model of this in our heads and can intuitively identify causes of failures or potential threats. Some of us may enter the BC/DR role blessed with a good mental model of the company's inner workings and its unique dependencies. For others the job focus is new, or the company too complex, and it requires intensive research and unraveling to comprehend.
This model building, done intuitively or more actively through interviews and documentation, can now be conducted with simulation capability in mind. Explicit modeling of dependencies helps catch non-intuitive dependency relationships which might otherwise be overlooked. Conducting simulations with that model will reveal the dangers of such configuration. You don't truly understand a thing until you've tried to build one, or in our case a model of one.
Tools are now available to model dependencies down to the IT component level, including not only hardware and software resources but people, locations and abstract objects like business processes or departments. If we model those dependencies, we produce excellent documentation of our environment even long before any simulation is attempted. Once the model encompasses enough information, computers can perform simulations of how certain failure scenarios might propagate disruption throughout the company.
What happens if a certain power system fails? How about the loss of key personnel, or the total loss of a given location? Answering these questions lets us peek into the future and estimate the fraction of service that remains and volume of work needed to recover.
Analyzing particular scenarios is the most obvious use for simulation, but we can move beyond the one-up, tabletop concept of simulating scenarios. By automating scenario simulation and conducting it wholesale, new doors open. Instead of picking a few interesting scenarios to simulate, we can perform simulations of failure against every single resource upon which we depend; it doesn't matter if we have a thousand or a million.
Imagine that, analyzing thousands of potential failures in a few minutes. With that capability, a new class of questions can be asked and answered. Instead of simply asking what happens if, we can ask, what would be the worst-case scenario? What are the top-10 worst? How would you like to rate every potential equipment failure in order of importance to the company? Or analyzing the relative value of each of each vendor relationship to your company's mission. Modeling and simulation can tell us this.
This information goes a long way in helping you allocate limited resources for preparation and recovery. The reason behind such allocation decisions are now rational, and backed up by reproducible evidence. If you think this is pie in the sky, it's not. The technology is becoming accessible to do exactly this, and just in time.
Cloud computing appears to reduce our need to manage messy resources like physical computers requiring space, power, cooling and hardware technicians. But this only shifts these responsibilities to someone else, perhaps with more specialized skill and greater economies of scale. The threat of failure however, still exists and in many ways is more obscure, more likely to creep up on us.
If you've buried all your treasures on islands in the South Pacific, you'd better make a good treasure map to keep track of them. So as we deploy resources to the cloud, and outsource more services, we must be more explicit in documenting and analyzing their existence, use, and behaviors.
If you have physical assets, your dependency model includes servers, switches, your datacenter, etc... If it's outsourced, that model includes network connections, access accounts, bill payment processes, and the vendor itself. The types of objects in the model change, but the problem remains. Instead of simulating a disk failure, you'll simulate the impact of not paying your bill, or of your vendor going out of business. Even missing a trivial $35 payment could down your business. It happened to Microsoft not once, but twice!
(Both Microsoft's passport.com in 1999 and hotmail.co.uk in 2003 domain registrations lapsed from non payment of renewal fees. Although only the passport.com event caused actual downtime, both events certainly didn't help their reputation.)
Modeling and simulation gives us a framework for collecting and using the knowledge we assemble about our companies. As we model the deployments and dependencies, we create excellent documentation assets in the form of diagrams detailing what needs what, and where it's located. We also produce the model which lets us answer questions like why is this server considered Tier 1, or why do we need a four-hour response time on that particular support contract? It's just a matter of connecting the dots, or more specifically the dependency relationships.
Such a sophisticated approach to understanding a business might seem like overkill for smaller companies. But you'd be amazed at how quickly the challenge of comprehending a company's dependencies become overwhelming without a good strategy for tackling it. In the not so distant past, before virtualization and Web technologies, the rate at which complex dependencies crept into use were limited by physical constraints like the time it takes to procure new hardware or the lag in approving and engaging a new vendor. Now change happens quickly. New operating systems are deployed in seconds, or migrated to other datacenters on the fly. Business is moving faster so the exposure to these dependencies is growing and constantly in flux. By modeling and simulating dependencies we give the BC/DR planning and response efforts new capabilities to protect business operations from the risks of failure.
Daniel Evenson is the CTO of Pathway Systems. Before that he was a Unix and security consultant and systems architect for several large corporations and government agencies. He is an expert in the modeling and simulation of complex information.