Last fall the DRJ Glossary of Terms committee revised the definition of recovery time objective (RTO). This column will debut RTO’s closely-related term, recovery time capability (RTC), and describe some of the challenges with its calculation and dissemination.
Recovery time capability is:
“The demonstrated amount of time in which systems, applications and/or functions have been recovered, during an exercise or actual event, at the designated recovery/alternate location (physical or virtual). As with RTO, RTC includes assessment, execution and verification activities. RTC and RTO are compared during gap analysis.”
For comparison, RTO is:
“The period of time within which systems, applications, or functions must be recovered after an outage. RTO includes the time required for: assessment, execution and verification. RTO may be enumerated in business time (e.g. one business day) or elapsed time (e.g. 24 elapsed hours). Assessment includes the activities which occur before or after an initiating event, and lead to confirmation of the execution priorities, time line and responsibilities, and a decision regarding when to execute. Execution includes the activities related to accomplishing the pre-planned steps required within the phase to deliver a function, system or application in a new location to its owner. Verification includes steps taken by a function, system or application owner to ensure everything is in readiness to proceed to live operations.”
Organizations should quantify RTC in elapsed time or in business time, in parallel with the approach for its associated RTO. Recovery of people/functions typically is denoted in business time. The recovery of applications may be quantified in either manner, depending on whether the application is required on a 24/7 basis. If so, the RTC should be stated in elapsed time.
The clock “starts ticking” for RTC at the point of outage (the start of assessment), not at the point recovery activities begin. Customers are “doing without” the application or service while the recovery team is assessing the situation and assessment may require much more time than the team expects.
Organizations should track the duration of all three phases (assessment, execution and verification) of RTC. Different factors influence the phase length.
Of RTC’s three phases, duration of the first is the most difficult to estimate because of many variables, such as: the number of people and/or applications impacted by the disaster; whether the outage occurred during business hours; the number of people who must participate in the assessment effort; whether part of the assessment team was also impacted by the disaster; the time required to gain physical or remote access to the affected site; whether the disaster provided advance warning (e.g., a hurricane); the time required to access recovery plans; whether recovery execution (for applications) is automated; and the availability of communication means. For these reasons, organizations should not create overly-optimistic assessment estimates. Assessments for most non-automated recoveries should be expected to consume hours, not minutes.
Calculating execution phase duration also presents a challenge. Since the execution phase begins when a decision is made to invoke a plan, it also includes the time required to transport the recovery team to the recovery site (or gain remote access for application recovery).
The scope of an outage can dramatically increase execution and verification phase durations. A data center outage, for example, will significantly lengthen these phases for all applications for which timeframes were established based on a single application outage. Similarly, verification team resources evaluating multiple applications may have to review them in sequence, and delay recovery for some applications.
Context is an essential consideration for RTC because it establishes an informal expectation with any customers who know about it. A business continuity team that publishes RTCs without context may increase its reputational risk by establishing unrealistic expectations.
Because of these challenges, BC/DR teams have several options for sharing RTC information with their clientele, such as: share RTCs on a limited basis (e.g., use as an internal metric); share them in the context of a specific scenario (e.g., single application outage); identify multiple RTCs based on different planning scenarios or on whether a disaster event occurs during or outside business hours. Organizations must consider all aspects when deciding whether or how to share RTC information.
The difficulty of calculating the RTC increases the importance of tracking it at every opportunity, whether in exercises or actual events. Exercise design should consider increased complexity to help calculate more realistic RTCs.
Recovery time capability is a concept whose value is maximized only when all three components are tracked. By recording RTCs, organizations can gain confidence in their ability to achieve RTOs, or identify the need to change them. Recovery teams should share RTC results but must explain context to all stakeholders.
Frank Lady, CBCP, CISSP, CRISC, PMP is a vice president of business continuity at Bank of America. He has been a member of the DRJ Editorial Advisory Board since 2008, and chairs its Glossary of Terms Committee. Lady welcomes article feedback and glossary of terms suggestions at email@example.com.