When (note: “when,” not “if”) a disaster strikes your company, critical, time-sensitive business processes must continue their operation to ensure your company continues to function as a reliable business. Located in earthquake country, California companies must assume a worst-case scenario where people are injured, freeways are impassable, and phone service is hindered. Other states experience natural disasters, such as hurricanes, tornadoes and floods, which can also impact availability of employees and transportation options. You may have operations outside the state that will continue to produce products or provide services, but what business processes are conducted locally, upon which those other sites are dependent?
Business recovery time objectives (RTO – the time to recover the system after a disaster) are usually determined by a business impact analysis. System recovery time objectives are usually defined by the method used and time required to recover the system. When was the last time you analyzed and compared these two for any gaps?
Does your company consider RTO and recovery point objective (RPO – time passed since last backup prior to a disaster) when a new business systems application is designed and/or implemented?
Many companies have system development and/or project management life cycles. Where in that process/methodology do you find your recovery strategy being applied?
Best practice has the business impact analysis and application criticality defined in the initial assessment, prior to approval. Recovery strategy will impact the hardware design and dictate data backup solutions. A high availability solution will be more costly than a recovery time objective of several weeks.
What risks are you willing to incur for what cost? This is all dependent on the criticality of the application and the business process it supports. The disaster recovery plan also needs to be tested prior to the system being deployed into production.
A disaster is an interruption of a system for an unacceptable period of time. A business process may utilize multiple computer systems and platforms. And each business process may have a different unacceptable period of down time. The focus of any recovery plan must be on keeping the business running – not keeping the computer running. Are recovery objectives in sync between the business process and the multiple systems and platforms that support the process? Business continuity requirements should be done before the computer requirements are defined. The business owner needs to determine what is an acceptable risk for the business. The business owner should determine the RTO and agree to the resulting cost of the systems implementation to meet that objective. If a project is implementing new computer technology not currently related to a new business application, the business processes that will eventually be dependent on the project’s deliverables will determine the recovery requirements.
It’s helpful to have a decision matrix that describes who has responsibility for approval vs. review and information provider. For an example of a decision matrix, please look at the graphic on this page.
RTOs are usually tiered. You’ll need to look at your company’s unique requirements as to how many tiers are appropriate for your organization – more than five usually become unmanageable.
An example of four tiers for RTO is:
• Tier 0 – Fault tolerant – virtually no impact to end-user if system goes down. Replication is part of the design of the system/application and usually requires Tier A RPO.
• Tier 1 – RTO of less than 24 hours. Requires hot standby equipment and usually a Tier B RPO.
• Tier 2 – RTO of less than 48 hours. Production takes over test and development equipment in the event of a disaster. This usually only applies when a company has a second data center, where production runs at one site, with test and development at the other site.
• Tier 3 – RTO of greater than seven days. Requires acquisition of hardware and restoration of systems.
RPOs are determined by the amount of data/transactions that can afford to be lost.
Possible RPO tiers are:
• Tier A – No data loss
• Tier B – RPO of less than 24 hours
• Tier C – RPO of last backup (in most cases, will be 24-36 hours)
RTO and RPO need to be defined whether you are recovering at your own alternate data center or you are recovering at a cold or hot site operated by a third party. Third-party providers (such as IBM and SunGuard) now have advanced recovery services that can meet high availability requirements for RTO and RPO.
Answers to the following questions addressed to the business owner will assist in determining RTO and RPO:
1. What does this business process use to do its work?
2. What resources (people, skill sets, other tools) are needed for this process to continue to function in a disaster mode?
3. What vital information flows through this business process, either from another process and/or to another process? What other business processes are dependent on the activities of this process?
4. What activities of the process can be done manually, if needed? What manual work around procedures could be put in place to minimize either the financial or non-financial impacts?
5. What would be the direct financial loss to your company if this business process was not available for 24 hours? One week? Three weeks? How is this loss calculated? What components contribute to this loss?
6. Does this business process have business cycles? Would a significant loss to your company be different at different times of the year? What months are critical? Are there times of the month that are more critical than others?
7. What is the business recovery plan? Are there subject-matter experts outside of the affected area that could process the work if critical employees are not available?
8. What are the negative impacts of the following non-financial concerns if this process does not function for 24 hours? One week? Three weeks?
a. Cash flow (generation of revenue)
b. Public image
c. Shareholder confidence
d. Financial reporting
e. Managerial control (for example, approval levels)
g. Competitive advantages
h. Industry image
i. Customer service
j. Vendor relations
k. Legal/contractual violations
l. Regulatory requirements
m. Employee morale
n. Consumer confidence
9. For each day of outage, how long will it take to handle the critical backlogged work, in addition to other daily work, when this process is back in operation?
10. What expenses would be incurred if this process were disrupted?
a. Temporary employees
b. Emergency purchases (supplies, office machines, etc.)
c. Rental/lease of equipment
d. Wages paid to idle staff
f. Temporary relocation of employees to alternate business recovery location (assume not working from home)
11. What other vulnerabilities and exposure exist with this business process?
In most circumstances, the definition of RTO and RPO will be an iterative process. There is no absolute formula. There is also a negotiation process with the business owner to balance the risk with the cost. That is, there may initially be a requirement for a short RTO and RPO. But after weighing the costs of the solution, the business owner may accept a longer RTO and RPO that would be less costly. How much risk are they willing to take for what cost?
As with other business continuity plan components, an annual review of the RTO and RPO requirements should be done to capture changes to both the business environment and the systems environment.
So if you haven’t already done so, get to know your counterpart on either the business side or the technology side. We’re all in this together.
Karen I. Dye, CBCP is the Business Continuity and Disaster Recovery Manager for Clorox. She is director of Business Recovery Managers’ Association in Northern California. She also has managed disaster recovery at a major bank and a national retailer before joining Clorox.