Spring World 2018

Conference & Exhibit

Attend The #1 BC/DR Event!

Spring Journal

Volume 31, Issue 1

Full Contents Now Available!

In the business world, computer disaster recovery planning is evolving toward business continuity planning. In recognition of this trend, in 1995, DRI International, an organization founded in 1988 to provide a base of common knowledge in continuity planning, replaced the designation for Certified Disaster Recovery Planner (CDRP) with Certified Business Continuity Planner (CBCP). What is the difference between disaster recovery and continuity planning? In theory, a disaster recovery plan is reactive and usually focuses on the computing environment.

Although work is done to harden the computing infrastructure to prevent a disaster, the plan’s main purpose is to recover from damage to the infrastructure. In contrast, a business continuity or contingency plan is not only proactive, but it is also targeted at keeping the business running, and not just recovering the computers.

Many companies today do not have a working continuity plan. Of those companies that do develop a continuity plan, many proceed without sufficient knowledge or input from end users.

For example, auditors or managers often direct someone within the IT department to write a plan to back up the company’s data centers. Frequently, the IT operations staff backs up everything running on a particular system, or even the entire data center, so that all information, critical or not, is recovered at the same time – even if the business function the data supports either is not critical or can be replaced by manual procedures.

Also, because end users are not involved in developing the continuity plan, their manual procedures, physical facilities, hard-copy records, and other special needs are often overlooked. Thus hardware and applications are being recovered, but not the business processes that use them.

In a business continuity planning program, individual business functions identify their critical business processes and develop separate (but coordinated) continuity plans for each of them.

The benefits of this distributed approach are many:

-Business processes that are not critical do not hinder those that are, so that limited resources can be used effectively.
-Infrastructure that supports noncritical business processes does not get recovered.
-Multiple critical business processes/applications can be recovered in parallel.
-Applications that normally run on different systems can be recovered on the same system, if necessary.

The survival of your business after a disaster depends on having a continuity plan in place. This article details the procedures for launching a continuity planning program.

Beginning the Continuity Planning Process

Where do I start?

Suppose you are beginning a corporate continuity planning process for the Absolutely Best Company (ABC), a manufacturer of top-quality widgets (an imaginary company and product).

There are several steps involved, as detailed in the following paragraphs.
Gain management commitment

For the continuity plan to be successful, management must be committed at the highest level.

The plan must be part of the strategic business plan, and the company must budget appropriately and separately for the continuity planning program.

Identify critical business functions

Assuming that you have management support, the next step is to identify how the company obtains its revenue in terms of business functions. For example, ABC first takes orders for widgets and then builds the widgets to meet those orders. Next, the widgets are installed, and the customers are billed. Finally, the employees are paid so that the process will continue.

Other revenue comes from service and support, but because those are distributed functions, an event that has an impact on the corporate site should not affect them adversely.

Once you have defined the gross critical business functions (not the infrastructure, such as computer applications, that supports them), a risk assessment and business impact analysis should be performed for each of the business functions and then, if appropriate, for the infrastructure supporting them. Remember also to analyze dependencies because a business function that appears noncritical could be supporting one that is critical.

Build business process core teams

You start by building business process core teams consisting of information technology (IT) operations management, end-user management, applications support staff for each critical business function, and the records management department. This team technique is called the Delphi method, and hence the team is called the Delphi team. Through the Delphi teams, you develop a clearer view of the infrastructure (for example, processes, records, IT applications) the teams believe are critical to performing their business functions.

Build a corporate team

You should also build a corporate team, consisting of members from the accounting, auditing, information technology, facilities, human resources, legal, public relations, investor relations, purchasing, postal services, records management, risk management, safety, security, shipping/receiving, and telecommunications departments. In a disaster, not only will these departments be required to continue their support roles, but also they may have to implement major infrastructure changes to support the affected areas.

Risk assessment and business impact analysis

The purpose of a risk assessment and business impact analysis is to answer the following questions:

-What am I trying to protect?
-What am I trying to protect them from?
-What controls are currently in place or needed to prevent or minimize the effects of potential loss?
-How much am I willing to spend on those controls?
-Is the money I am spending effective?

Thus the risk assessment involves identifying threats, vulnerabilities, risks, and the business impact of a disruption for each entity. Before you begin the ranking process, determine what criteria to use. Generally, they are split between quantitative and qualitative. Quantitative losses can be expressed as a number, such as an annualized loss exposure (ALE).

To start the risk assessment, rank all of the entities whose loss could negatively affect business, gain consensus from each Delphi team, and then merge the results for presentation to and concurrence by upper management.

Recovery time and recovery point objectives

As part of the risk assessment, the Delphi teams estimate how long an entity can be unavailable, how old the information supplied by the entity can be, and how much of it can reasonably be lost when it is made available again. That is, they determine the recovery time objective (RTO) and recovery point objective (RPO).

Recovery time objective refers to the time from when the event occurs until the business process (for example, the accounting department, the accounting application, or manual procedures used by accounting) must become active again (recovered).

Recovery point objective describes the point in time to which the data must be recovered – stale (old or obsolete) information no longer reflects the state of the company. This can also be thought of as the freshness window. The teams consult with management to confirm their decisions.

The longer the RTO for a particular entity, the less the cost will probably be to recover it (figure 1). Unfortunately, at the same time, the losses from the entity that is unavailable are escalating (figure 2).

RTO and RPO are not necessarily linked to each other; that is, a short RTO does not imply a short RPO. For example, a database might need to be recreated from backup tapes in less than a day (RTO), but while the online data must be current, it is acceptable for the batch data to be one week old (RPO).


 Figure 1. Cost of recovery. Figure 2. Loss due to the time an entity is unavailable.

 

 

 

 


Disaster Tolerance:
Closing The Freshness Window


Real-time technologies, such as online data replication or vaulting, can practically close the freshness window. They are more expensive than routine backups but could be deemed necessary in some situations, such as Web transactions, wire transfers, and supply chain applications. Replication or vaulting duplicates data onto an off-site location as it is manipulated on the primary system. The semantic difference between these two terms usually is that vaulted data is batched and/or stored offline and needs to be moved to the backup hardware, while replicated data is sent in near real time, possibly directly to the backup hardware, and is ready to run.

f your company is already geographically dispersed and application uptime is imperative, you can create application domains at more than one site and distribute the load. When the load is being shared in this manner, you do not actually have primary and secondary systems or sites. What you do have are the beginnings of indestructible, scalable computing.

Alternative plans and controls

Once risks are assessed and recovery windows are determined, the planner can ask the Delphi teams to begin outlining possible continuity plans for their business functions, starting with the most critical. The baseline is the before risk or ALE (no plan or controls in place). For each alternative plan or control, the Delphi teams need to calculate the after risk or ALE (plan and controls in place) along with the cost of the plan and controls.

At some point, the potential loss reduction (savings) will be less than the expenditures required to develop and implement the continuity plan (figure 3). Here is where the executive staff comes into play again. They need to determine which controls and recovery window each plan should address based on cost versus savings and time to deploy.

Figure 3. Plans I and III would break even (at different levels). Plan II would generate savings in excess of its implementation cost, and Plan IV would cost more than it would save.

 

Documentation and Standards
Required documentation

Before developing a continuity plan identifying activities to be performed during a disaster scenario, you need to understand how those same activities work every day. This means that you need access to some basic documentation for each business process. You also need to gather more specific information about your company’s business functions. For example, find out if sufficient application downtime is scheduled to back up the databases used by the business functions, or, if the software allows it, if the databases are being backed up online. (If your company runs 24 x 7, you don’t have time to take your system down for backups.)

Determine if there is an archival process to remove inactive records from hard-copy files and databases so that they are kept at a manageable size. Also, identify where critical records are being stored: on site, off site, out of the regional area.

If any of this information is missing, a continuity plan can be started, but it may not be effective.

Without knowing how the business functions run and what is required for normal processing, no one can guess the requirements for exception processing. Actually, the bulk of the work towards developing a continuity plan lies in ensuring that standard business practices are documented and followed.

Developing standards: A “cookbook”

The continuity planner either develops or purchases a standard set of forms and procedures to be used by each function: a continuity plan “cookbook.” Without standards, each business function’s plan will look different, making coordination difficult.

The continuity planner develops a standard set of forms and procedures to be used by each function: a continuity plan “cookbook.”

Different practitioners divide and name the phases of the continuity planning program differently.

The Disaster Recovery Institute defines seven phases of a continuity planning program as follows:

-Phase 1: Project initiation
-Phase 2: Functional requirements
-Phase 3: Design and development
-Phase 4: Implementation
-Phase 5: Testing and exercise
-Phase 6: Maintenance and update
-Phase 7: Execution

Your cookbook should be developed or software selected during phase 3 to be used by the business function teams during phase 4. At a minimum, it should contain the following:

-Step-by-step approach for each group to follow in writing the continuity plan
-Corporate team description, stating which corporate resources will be available to assist each business function in developing its plan
-Notification process: Although the planner maintains the first-level response part of the plan, including lists of important phone numbers at the corporate level, the security department probably should start the actual notification process in the event of a disaster.
-Plan considerations: Identifies the issues that must be addressed by business functions as they write their plans
-Responsibility list: A script or checklist, by job function, of what each person will be required to do during the seven phases of plan execution–evaluation and declaration, notification, emergency response, interim processing, salvage, relocation/reentry, and resumption of normal processing

Writing the continuity plan

If employees within each business function write the plan for that function, you achieve multiple goals:

- Employees know what the day-to-day activities are.
-Numerous functions can be generating plans at the same time.
-You gain buy-in of the business function employees.

A designated continuity planner should be available to answer questions as each group writes its continuity plan. Not only does each business function need to generate a plan, but also each department represented on the corporate team needs to have a plan, in case those departments are also affected by the disaster.

Exercising the plan

Continuity plans should not be tested–they should be exercised. Tests are passed or failed, whereas exercises are conducted for practice. Exercise the plans thoroughly to ensure that they work. During the exercise, note any problems that occur and encourage feedback from participants. The purpose of the exercises is to reveal any defective or missing components in the plans. It is counterproductive to reprimand someone for pointing out errors or omissions.

Some companies spend hundreds of hours testing and refining specific pieces of their plans until they are satisfied with the time or accuracy of execution.

It is best to exercise and update the plans at least annually, or when major changes occur. Update call lists quarterly. It is better to have no plan at all than to have an out-of-date plan–such a plan lends a false sense of security and wastes time during an actual disaster.

Conclusion

A continuity plan should not and cannot be written by the IT department alone, nor should it be written solely for a given computer or data center. Developing such a plan is a long-term process that requires substantial human and monetary resources throughout the company. Without a long-term commitment to continuity planning from the highest executive levels, efforts to develop such a plan are bound to fail.

The planning process cannot even begin without documentation for everyday processing already in place, including change control procedures, standard operating procedures, run books, data flow diagrams, problem isolation procedures, and a tape backup or rotation schedule. As a by-product, continuity planning forces more formalized standard documentation across the entire company, resulting in faster isolation of application bugs, fewer operational mistakes, reduced support requirements, faster training of new personnel, and easier maintenance and enhancement of current applications.

Not only is a continuity plan required by many regulatory agencies, it could also mean the survival of your company.


Ron LaPedis, CBCP, has worked for Compaq for 20 years in various capacities, most recently as a product manager for security and business continuity products.