A "good start" is not that hard. An independent or internal auditor's comments are usually enough to compel a senior manager to move forward with a disaster recovery development effort. A near miss or a direct hit can also stimulate sufficient interest in developing Business Resumption Plans (BRPs) for your company. Unfortunately, after a few senior management presentations and perhaps a hot site test for your data center, the program will often enter a long and painful development stage - a development stage that can only be compared with the work of Sisyphus who was forever having to push a rock to the top of a hill only to have it roll back down. Similar to the old myth, many BRP projects never quite make it over the top.
Of course, BRP is never finished in the sense that it is always changing to reflect changes in the organization. It is, however, reasonable to expect all parts of your BRP will be developed and tested. It is also reasonable to expect that a process is developed and resources allocated to maintain the plans. This is what we would call a "done" project.
Before you read any further, ask yourself this question: "Is my BRP project done?" If you can answer "yes" - congratulations! Perhaps you could share your methods in these pages in a future issue. If you said "no," then you are welcome to the ideas presented here. Of course, there are many ways to successfully develop BRPs; this article describes the approach that worked for us at UNUM.
Conventional wisdom says that in order for any BRP effort to be successful, it needs senior management support. However, we hear from many of our colleagues that the support they received at the beginning of the development effort was not enough to sustain their project through to completion. So what can you do to maintain senior management support? The answer for us was to help senior managers relate to the effort and keep them interested in the project on an ongoing basis. That may sound easy and obvious, but if you are sensing a loss of senior management support, you have to ask yourself if you have really created a vision that is meaningful to them.
It has taken us about two years of sustained and rigorous focus to plan and develop a comprehensive, enterprise-wide business resumption capability at UNUM. For most organizations, as it was with UNUM, the Business Impact Analysis (BIA) was a helpful tool to get started, but we quickly became aware of its limitations. The BIA helped us to identify the major threats to our organizations and the impact of having one of those disasters wipe out one or more of our operations. However, the probability of any of the "BIG" threats occurring is fairly small. Most senior managers intuitively know that those threats exist and what the general impact would be to the organization. We found that you won't get much ongoing support from senior managers by running "The Big One" up the flag pole over and over again.
Most senior managers are optimists by nature. They look beyond obstacles and focus on opportunities. There is little appeal for them in looking at the infinitesimally small likelihood of huge disasters. On the other hand, keeping management focused on an effort that will address the safety of the organization's people, mitigate the impact on customer service, and maintain the financial well-being of the company are much more powerful motivators for senior management. They speak to the need to create a positive work environment and to ensure strong competitive positioning of your company in the marketplace.
Most BRPs are developed on a "worst case scenario" and, for most companies, it makes little difference if a building is destroyed by an earthquake or a flood, a fire or a bomb. Your building is gone and you need to recover. Consequently, we think that it is wise to invest most of your limited time, money and energy in the development of recovery plans at the functional, or work group level as opposed to focusing on a detailed Business Impact Analysis.
By moving quickly from the analysis stage to the development of the recovery plans for the functional areas, we were able to identify ways to increase operating efficiency; build a stronger, more stable organization; and improve our position and marketability in the marketplace. As we identified our key suppliers, we asked them to show us their BRPs. We now require suppliers that support our critical business functions to have a BRP. Even more importantly, many of our prospective customers are beginning to ask if we have BRPs before buying insurance from us. Keeping management informed and involved in these less obvious, but critical benefits, will help you maintain ongoing support from the highest levels throughout the project.
In order to muster support and resources, it is necessary to create a high-level strategy for the effort. Senior managers want to know the status of the project and how resources are being used. This is not just about getting people assigned to the project. An even more important decision is where those resources are positioned in the organization for maximum effectiveness. When positioning your resources there are a couple of extreme methods to choose from: 1) have a central staff exclusively dedicated to developing plans for all areas, or 2) have each functional team pick a person to develop their own plan. Each of the options has some merit but we don't think either of them is very attractive.
Having staff areas responsible for developing BRPs can provide some efficiencies, but it puts the emphasis on recovery in the wrong place in the organization. The functional business areas need to take full responsibility for developing and maintaining their BRPs. After all, if an incident wipes out a business function, it won't be the people in the staff area who will recover the operation. It will be the recovery team for the functional business area that will be responsible for recovering the operation. Ownership of the plan clearly belongs in the functional business area. Also, it can be surprising how quickly staff areas can lose touch with the "front-line reality". Then, there is the not-so-small matter of expenses: after all, how much money will your organization provide to a staff area to visit and work with every field location to develop and maintain BRPs?
The latter option presents a different set of issues. While making each functional business area responsible for developing its own plan places the responsibility for recovery in the right place, this option lacks focus and consistency. The most significant drawback is a lack of coordination and prioritization among functional areas. With no central area coordinating the effort, the functional teams are left to their own resources to develop plans. Each functional area often sees itself as the most important area for the organization to recover. The quality and consistency of plans across functional teams can also be hard to establish and even harder to maintain. Also, functional teams will often compete for limited resources, such as equipment and space, following a disaster. You may find that two or more teams list the same resource as critical to their recovery efforts.
So if these options generally don't work, what does? We decided to draw on the strengths of both options by developing a central control area responsible for coordinating the overall project and enlisting a small group of BRP planners who are located within the line operations.
During the initial concentrated effort to develop and implement the BRPs, the central control area was staffed by two full-time resources. This area reported directly to the president of the company and was responsible for coordinating the development effort, training the line planners, maintaining the central database and managing scarce resources. These individuals were also responsible for supporting the Incident Management Team and the Emergency Operations Center, if these needed to be activated.
As for the BRP planners, each major area of the organization made resources available, usually part-time resources (at least 1/2 time) familiar with the areas for which they would develop plans. Having planners from the areas for which they would develop plans was important in two ways. Much of the success in getting resumption plans developed in a timely fashion comes from the relationship that the planners have with the areas. Not only are the planners familiar with the areas, which gives them knowledge of the key people and resources, but the areas are familiar with them. With this relationship already established, the development of the plans goes much more smoothly and quickly.
Once we had senior management support and involvement, our organization in place, and the planners trained, the scope of our project was still quite large. We initially determined that there were approximately 95 functional area teams covering 4,500 people throughout North America that needed recovery plans developed, and we determined that it would take two years to complete the development of all BRPs.
Before we actually began developing the recovery plans for the functional areas, we decided to create an infrastructure to support their recovery. We focused our attention on three critical areas. The first was the development of our Incident Management Plan and the Incident Management Team. Next we created an infrastructure within our organization to support any recovery effort, i.e. replacement workstations, key supplies, etc. Finally, we focused on data center recovery procedures.
By focusing on the development of the Incident Management Plan, we were able to quickly get the members of the Incident Management Team, mostly senior level managers, involved with the development of a comprehensive recovery effort. Through a series of gradually more complex rehearsals of the Incident Management Plan, we were able to challenge the members of the Incident Management Team and demonstrate the importance of being prepared to handle any incident. When senior managers can relate to a disaster at their level and work through the problems the organization would face following an incident, they become more involved with and supportive of BRP. Nothing goes further at the functional level than a resounding endorsement of the importance of the effort by the senior manager of that area.
As the Incident Management Team was becoming proficient in handling mock disasters, we were making changes to the infrastructure of the organization. Some of the changes we made included the development of two Emergency Operations Centers for the Incident Management Team to use during a recovery effort. Having real Emergency Operations Centers for the Incident Management Team provided something tangible for the senior managers to grasp. They were able to work through the mock disaster under as real conditions as possible. Their familiarity with the Emergency Operations Center and how they would operate in it helped to create an experience that was tangible and tested their skills.
Within Portland, Maine, UNUM is "geographically diverse." By that we mean that we are located in several buildings spread over a wide geographical area in greater Portland. This provides us with several options from a recovery perspective that we would not have if we were located in a single building. Using our geographic diversity to our advantage, we developed floor plans and conditioned two of our cafeterias in different buildings with all the cabling (phones and data lines) necessary to support up to 300 workstations. This gave us the ability to put up to 300 employees back to work in a relatively short period of time after an incident. It also provided another tangible deliverable for senior management to work with when recovering from a mock disaster.
In addition to the Incident Management Team and the infrastructure changes that were being developed, we quickly started working on in-depth procedures for the recovery of our data center, communication networks, LANs, and other elements of our "information backbone." This involved, among other items, a hot site test at IBM's Sterling Forest Facility. Our business depends on our ability to access our computer systems and applications. We have a very complex computing environment and switching to a manual process is not an option when or if our computer system is down. Having a Data Center Recovery Plan in place was a cornerstone of our BRP foundation.
With the Incident Management Team, the Data Center Plan and the infrastructure components in place, we had a solid foundation from which to launch into developing the functional area recovery plans. As stated earlier, when we first analyzed the number of functional areas needing plans, we counted approximately 95. However, the more we got into the development cycle, the more areas we found required plans. This was due in part to identifying dependencies between areas, but also, as more and more areas completed their plans, the areas that we had either overlooked or omitted wanted to know when their plans were scheduled for development. You might say it became fashionable to have a BRP.
At current count we have 177 plans, almost double our original estimate. The scope of our project had doubled, yet the amount of time we had identified to develop the plans did not increase. Fortunately, because of efficiencies that we discovered along the way, we not only developed all 177 plans on time, we also wrapped up the development of the plans for all functional areas six months earlier than we originally anticipated.
In the planning stage, before we actually developed any plans, we put together a comprehensive development "handbook". That handbook had many tools to help the functional areas define their recovery needs and develop comprehensive plans. The tools included data flow techniques, worksheets, scoring grids, etc. All these tools were useful in their own way, but practically speaking, they significantly bogged down the development cycle. Using them all would have resulted in collecting an overwhelming amount of detail that would unnecessarily slow the process and be next to impossible to maintain.
As we developed more plans, the key components of an effective recovery plan became apparent. One of the key assumptions we made at the beginning of the project was that the expertise required to recover an area would survive the incident. This relieved us from having to develop personnel-specific BRPs. With that in mind, the key ingredients required for an effective recovery strategy were threefold.
We needed to identify:
- the key members, and appropriate alternates, within a functional area who have the expertise to recover an operation following an incident,
- a way to activate those people following an incident, and
- a place for them to meet to begin their recovery effort.
With these three things, even though you might not have the all recovery procedures documented, you will have the right people in the right place at the right time. Identifying these items was the first thing we did when we initially met with each functional area.
Once the functional area recovery team was in place and the team activation procedures were developed, we worked on developing the recovery strategy and the recovery procedures. This was followed by assigning responsibility for the recovery of each procedure to a team member. In the event that either the primary or alternate team member was not available during a recovery effort, the documented procedures provided the second or third alternate with the information required to recover the business function.
After the procedures were developed, we identified the material resources that the team would need to recover its operation. We did not attempt to develop our plans as a way to recover functionality at a normal operating level. We focused on what it would take to operate at a survival level. We catalogued the minimum number of workstations, unique software, access to fax machines and copiers, forms, supplies, customer and vendor contacts, etc. required to recover a business function.
During the development of the plans for the functional areas, and often as a by-product of developing the recovery procedures, we identified key exposures the functional areas had should an incident wipe out their capability to operate. The exposures included things like a lack of proper LAN back up procedures and a dependence on single source paper documents. These exposures were documented in the recovery plans and senior management was made aware of what they were. In fact, through the ongoing auditing of our plans we are able to monitor progress toward resolving the exposures, and management is kept abreast of where the company faces its greatest risks.
After our foundation was built and the majority of the functional area plans were developed, it was time to shift attention to addressing the exposures that we identified during the development of BRPs. Some of the simple exposures, such as having duplicate documentation stored off-site, can be addressed within the individual business areas. Addressing exposures that transcend the business areas may require more global solutions and coordination across business units. Others might require the functional business area to start working with technical support organizations and other areas to eliminate the exposure.
We have found that having the functional business areas drive the effort to eliminate an exposure is much more efficient than having a support area try to convince a business unit that they need to work on the technical aspects of their recovery plan. The reason for this, again, is that the functional business area owns the recovery effort. It also has the responsibility to eliminate the exposures that it faces. If the functional business area does not see the value in eliminating an exposure, or is willing to assume the risk associated with not eliminating it, then it is unlikely that a support unit will convince them to free up resources to work on addressing it.
In some instances exposures will impact a large number of functional business areas. These exposures require special attention. Since it would not be efficient for each functional business area to develop individual strategies to address these global exposures, a centralized effort is much more efficient. At UNUM we have identified several areas that fall into this category. Three of the key exposures that we identified are outlined below.
At UNUM, our data center is responsible for restoring the primary operating systems of our mainframe following an incident. However, it is the responsibility of the systems support organizations within the functional business areas to restore the applications that support the business functions. This makes sense in a large organization with several complex systems supporting different areas. The systems support organizations are the most familiar with the business functions and supporting applications. Following a disaster they will be the people who are best qualified to recover and test those applications.
During the process of developing BRPs for our functional business areas we discovered that there was a common misunderstanding that the applications and data files that support the functional business areas were being properly vaulted. It was more than a little startling for the managers in the functional business areas who assumed that they were protected to learn that they might not be able to recover their systems following an incident.
Developing a separate strategy to recover the application within each functional business area would have created more problems than it would have solved. We put together a team of representatives from each of the system support units. This team developed a solution that addressed all applications across the company. Our initial solution was to perform weekly full-volume backups of our production data and applications. Longer term we are working on developing the capability to create a mirror image of our systems so that we will be able to almost instantaneously restore capability following an incident at our data center.
As Local Area Networks take more of a front seat in the data processing world, the amount of data and applications on the LANs that support critical business functions increases accordingly. We found that there was no consistency across the functional areas with regard to backing up their LANs. Most functional areas that we worked with did indeed back up their LANs on a regular basis. However, when asked where the backup tapes were stored, many of them indicated that they were kept in the LAN room on top of, or next to the server. As we all know, this does not constitute a backed-up LAN.
As for our desktop workstations, like most companies that have several hundred autonomous workstations distributed throughout the organization, the backup procedures for personal computers, if they exist at all, are rarely followed or lack consistent enforcement. This is an area that can prove to be a tremendous exposure to an organization.
At UNUM, our policy is to make sure that all critical files, whether on a distributed LAN or on a workstation hard drive, are backed up to our consolidated LAN environment. This provides us with recovery capability even if only one workstation is destroyed by something as simple as a hard drive crash.
Probably our single biggest exposure is our dependence on the information that is contained in the paper files throughout the organization. These files often contain single-source, confidential documents that would be difficult if not impossible to recover following an incident. At UNUM, we are researching and will launch a document retention and retrieval program using image processing to protect our documents and to be able to recover them following a disaster. Image processing will allow us to scan documents in one location, store them in a separate geographical area of the country, and retrieve them in a third location.
The shelf life of a Business Resumption Plan is quite short. In as little as 6 to 12 months BRPs can lose most if not all of their value. They quickly become outdated because of changes in the organization, new procedures, new systems and changes in personnel. Over time, if the BRPs are not reviewed, updated, and rehearsed, people will forget what is expected of them. Consequently, we developed a structured maintenance process to ensure that our plans are kept as current as possible.
When the development of the majority of our plans was completed, we began working on creating our maintenance program. It is true that getting through the first level development cycle was critical to the success of our initiative. However, having an effective maintenance program in place will ensure that the investment that we have made in developing 177 functional area plans is protected.
Our maintenance cycle runs twice a year during off-peak processing periods. We have scheduled May and October as the times of the year when we will review all plans. Should there be a major change in the organization between cycles, the impact of those changes will be taken care of separately.
During the maintenance cycles we will also be able to address any quality issues with the functional area plans. These issues may include the development of more detailed procedures, the identification of additional resources required to recover an operation, and normal maintenance items such as updating team member telephone numbers. The first two or three maintenance cycles will be the most demanding in making sure the plans are comprehensive and of consistent quality. As each cycle goes by, the plans will become more and more refined and the changes to the plans will become less extensive.
Eventually you will also need to address your ongoing planning resource requirements. After the initial plan development, the effort needed to maintain the plans should be about 25 percent of the original development effort. In developing our plans at UNUM, we expected that the planners who worked on developing plans for 12 to 24 months would transition off the project. Completing the development of 177 functional area plans, three incident management plans and building an infrastructure to support a recovery operation in less two years were major accomplishments and we wanted to make sure that we celebrated our success. However, we needed to identify resources for the ongoing maintenance of our plans.
As part of the transition from the development phase to the maintenance phase, the development planners will be required to train their replacements. We will use the first maintenance cycle as the opportunity to train the new planners. This will familiarize the new planners with the area that they will be supporting, the plans for those areas and the overall BRP process.Having the new planners performing a maintenance cycle together with the seasoned planners seemed to be the best form of "on-the-job training."
We do not pretend to have all the answers for your organization detailed in this article. Every organization has different dynamics that will dictate what it needs to do to be successful in developing its own BRP program. We do hope however, that this article will help BRP practitioners make their efforts more effective. Using the approach outlined above, we were able to develop and implement the entire plan for a large and complex organization in less than two years and provide a solid foundation for the future.
Chris Glancy is a BRP Consultant for UNUM Corp. & Piotrek Stamieszkin is the Director of BRP for UNUM Corp. UNUM Corp. is based in Portland, ME with operations in the U.S., Canada, the U.K., the Pacific Rim, Europe, and Bermuda.