Many times organizations conduct traditional disaster recovery exercises where testing is done in silos, and the scope is limited and restricted only to host level recovery of individual systems. With growing technology changes and globalization trends, the intricacy and interdependencies of applications have become more complex in recent years, and major applications are spread across multiple locations and multiple servers. In this scenario, a traditional recovery exercise focusing on server (host) level recovery is not going to adequately ensure the complete recovery of the application without any inconsistencies among various interdependent subcomponents. In a widespread disaster scenario involving major outages at the data center level, it is fairly certain that this kind of limited exercise is not going to be sufficient to assure the realistic readiness status and overall recovery time objective (RTO) for multiple applications. Therefore, organizations should increase the scope and complexity of disaster recovery exercises over time and ensure that each exercise is process-oriented and focused on “end-to-end” recovery. This article addresses some of the technical challenges faced in end-to-end disaster recovery exercises which attempt a full life cycle of transactions across disaster recovery applications and their dependencies and simulate business activities during the exercises.
Growing reliance on information technology, along with compliance and regulatory requirements, has led many organizations to focus on business continuity and disaster recovery (DR) solutions. Availability has become a major concern for business survival. Therefore, it becomes mandatory that one should take a detailed look at disaster recovery testing and the specific steps to ensure a disaster recovery plan performs as expected. An end-to-end disaster recovery exercise would provide realistic readiness status and bring out any complexities or intricacies involved in recovering multiple applications in the case of any widespread disasters, including a data center level outage.
There are a lot of challenges in an “end to end" disaster recovery exercise approach compared to traditional disaster recovery exercise since one needs to consider all the dependencies and should take into an account an end to end view to understand the full functionality of the applications.
This article illustrates that some of the challenges faced in an actual “end to end” disaster recovery exercises conducted for applications which interfaced with external third parties and had heavy reliance on middleware components and batch jobs
Why a Disaster Recovery Exercise is Required
Disaster recovery plans represent a considerable amount of complex and interrelated technologies and tasks that are required to support an organization’s disaster recovery capability. Constant changes in personnel, technologies, and application systems demand periodic plan validation to assure that the recovery plans are functional and remain so in the future. Without this validation, an organization would not be able to demonstrate that the documented set of recovery plans support current recovery operations that will be needed to sustain critical business functions in time of disaster.
The periodic disaster recovery exercise is required to validate the documented recovery procedures, assumptions, and associated technology used in the restoration of the production environment.
Issues in Traditional Disaster Recovery Exercises
How many organizations attempt a full life cycle of transactions across disaster recovery applications and their dependencies and simulate business activity as part of disaster recovery exercises? Many times organizations conduct traditional disaster recovery exercises where testing is done in silos, and the scope is limited and restricted only to host level recovery of individual systems. In most of these disaster recovery exercises, the participating team is comprised of only the information technology team without involving any business users. Generally the primary objectives in such exercises will be restricted to recovery of standalone systems without involving any integration with upstream or downstream dependencies.
Typical application validation carried out in this exercise includes login validation, form navigation, and search validation without testing any connections to other dependent applications or any business activity. Most of the time traditional disaster recovery test activities are limited to travel and the restoration of hosts at the recovery site and not anything further. Major drawbacks in this type of testing are that one will not know, until the actual disaster, how the integration part is going to work, what the main dependencies are. and what the impact may be due to any network latency related issue.
With growing technology changes and globalization trends, the intricacy and interdependencies of applications have become more complex in recent years, and major applications are spread across multiple locations and multiple servers. In this scenario, a traditional recovery exercise focusing on server (host) level recovery is not going to be adequate to fully recover the application without any inconsistencies among various interdependent subcomponents.
This kind of limited exercise not involving end-to-end disaster recovery activities and without attempting to simulate business activity is not going to be sufficient to reflect the preparedness to handle a real time disaster and to assure the required overall Recovery Time Objective (RTO) for multiple applications.
Why We Need an End-to-End Disaster Recovery Exercise
A limited scope disaster recovery exercise not involving end-to-end disaster recovery activities and without attempting to simulate business activity is typically based on asset level (example: specific server or application) outage scenarios and not based on any widespread site (datacenter/city level) level outages.
Therefore, in order to ensure effective disaster recovery preparedness, organizations should plan for an end-to-end disaster recovery exercise including all interdependent applications in scope. This will bring out the practical issues involved in performing the business transactions in the disaster recovery environment and verify the real effectiveness of disaster recovery procedures.
Challenges in an “End-to-End” Disaster Recovery Exercise
An end-to-end disaster recovery exercise focuses on complete recovery of applications and their dependencies across various layers, including presentation, business logic, integration, and data layer. It takes into account the required data consistency among various interdependent subcomponents and sees the recovery from the business process perspective.
Since an end-to-end disaster recovery exercise attempts a full life cycle of transactions across disaster recovery applications and their dependencies, and simulates a business activity during the exercise, there are many challenges in conducting an “end-to-end” recovery exercise. Typical challenges faced are:
- isolating the DR environment
- replacing hard coded IP addresses and host names
- connecting to dependent systems not having a corresponding disaster recovery environment
- proper sequencing of applications
- thorough preparation and coordination
- ensuring a back-out plan and data replication during the exercise.
This article assumes a parallel exercise scenario and highlights the common technical challenges faced in conducting an end-to-end disaster recovery parallel exercise in a warm site. In a parallel exercise, the DR environment is brought up without interrupting or shutting down the production environment.
Isolating the DR Environment
As everyone will agree, we need to perform the disaster recovery exercise without any interruption to production. This is very easily said, but it is the toughest challenge for the disaster recovery coordinator, especially when it is required to do a parallel test at a warm site. Isolating the DR environment and at the same time conducting the full life cycle testing requires a lot of planning and coordination.
The key issue with a full life cycle test is the potential interruption to production systems by unintended access either by other applications or batch jobs. This may result in updating some transactions in the production environment during the test since these restored systems might have the same host names or IP addresses as production systems. Any production interruption such as any duplicate financial transaction for paying a vendor or any missing critical transaction due to a disaster recovery exercise could put your disaster recovery effort in jeopardy.
One should ensure that disaster recovery instances are not connected to the production environment at all layers, including the database and network layers. For example, at the database layer, the tnsnames.ora file or database (DB) links should be updated to ensure that only DR instances are speaking to each other. At the network layer, appropriate firewall rules should be implemented to block any traffic from the disaster recovery environment to the production environment.
In an isolated DR environment, there will be challenges for the desktop clients/end users to connect to the DR environment and to verify whether the production or DR environments are accessed. These challenges can be overcome by allowing access to the disaster recovery environment via DR-Citrix, DR host names, or direct DR-URLs as applicable. End user client machine’s local host file and configuration file need to be configured to point to DR host names instead of production host names during the DR exercise.
Replace the Hard-Coded IP Address and Host Name
In many organizations, a major issue in disaster recovery exercises is hard-coded IP addresses and host names in applications, particularly in batch jobs. There is a possibility that interfaces and batch jobs might fail or interrupt production systems during the exercise if there are any hard-coded IP addresses or host names. Hence one needs to thoroughly analyze all the involved systems and identify any hard-coded IP address or host name. As a best practice, one should always reference alias names and avoid any hard-coding of host names or IP address. One of the important tasks for effective disaster recovery implementation is to convert every application to reference alias names, not the primary host names listed in DNS or IP addresses.
However, it might become a tough job to replace the hard-coded host name or IP address for some of the applications which were developed several years ago. In such cases, it is suggested to use automated scripts as much as possible to replace the production host name or production IP address to the respective DR host name or DR IP address. These DR scripts should be documented and ensured that they are not overwritten during storage replication to DR environments.
Connecting to Dependent Systems Not Having a Corresponding Disaster Recovery Environment
One of the key challenges in an end-to-end disaster recovery exercise is how to test the connecting interfaces with other applications which do not have a corresponding disaster recovery environment. For instance, as represented in figure 1 let us assume an application X, which is hosted at the disaster recovery site and needs to interface with application Y, which is hosted at a third-party site. If application Y has a corresponding disaster recovery system, then we can connect both disaster recovery systems during the exercise. Otherwise, one needs to look into options of using the other available environments, such as development, test, or pre-production systems of Y application for testing. Flowcharts, data feeds, and architecture comparisons for production and disaster recovery would help in identifying all the required components for the successful functioning of applications in a disaster scenario. An Interface architecture comparison done between production and DR environment is shown in figure 1. In the below DR Interface Architecture Drawing, since Y which is a vendor application, did not have any DR environment, DR exercise is conducted by connecting to the test environment of Y application from DR environment of X application.
Figure 1: DR Interface Architecture Drawing
Proper Sequencing of Applications and overall RTO A crucial challenge in most disaster recovery exercises is the proper identification and sequencing of upstream and downstream dependencies. When performing a disaster recovery exercise with a full life cycle of transactions for 20 or 30 applications, sequencing of applications becomes very critical. The sequence should be planned out properly based on the dependency and agreed overall Recovery Time Objective (RTO) requirements for multiple applications. Documenting all the critical interfaces for a disaster recovery scenario would help in ensuring proper sequencing of applications. While considering the dependencies for application, the interfaces need to be analyzed for business requirement of the data and the frequency at which they run. Figure 2 illustrates the resulting application dependency analysis diagram. As illustrated in this diagram, D1 is the application which needs to be brought up first in DR environment before bringing up the DR application X. This is due to the reason that D1 provides critical input data to X, without which X cannot function appropriately. Inbound interfaces which feed data to applications are required to be brought up first at DR site in most of the cases. In this example, applications marked as D1, D2, D3, and D4 are brought first followed by which application X is brought up as D5. Under this scenario, RTO for application X (D5) will depend on the RTO for other four dependent applications (D1, D2, D3 and D4) and this overall RTO should meet the business requirements as well. Applications for outbound interfaces are brought up subsequently. Applications can also be brought up in parallel instead of in sequence as per the business requirements.
Figure 2: Application Dependency Analysis
Thorough Preparation and Coordination
In disaster recovery exercises, one can tend to skip the proper sequence of DR exercises, or one can overlook the importance of the sequence. But in the road to an end-to-end disaster recovery exercise, it is crucial to thoroughly follow a proper sequence of testing, namely walk-through, simulation, parallel, and then full interruption exercises.
A walk-through and simulation test is required first among the various participating teams (network/firewall, server, database, middleware, and various applications) to ensure that everyone knows what the scenario is, who needs to do what, and what is the sequence. These tests bring out the potential risks to the production environment during DR exercise and the coordination or sequence related issues in recovery procedures. Thorough preparation and coordination involving a great deal of planning, involvement from all the participating teams, and "mini" tests to test all the subcomponents would result in identifying most of the potential issues before they occur and in eliminating most of the human errors.
￼Figure 3: Performing Disaster Recovery Exercise Using Point-in-Time Copy of Data
Ensuring Back-Out Plan and Data Replication During Exercise
As always, one needs to ensure the appropriate process is in place for a solid back-out plan (a restore point prior to test start) and how to abort the exercise in the event of anomaly or critical business needs while performing the exercise.
One also needs to ensure that data replication is not stopped while testing if there is a continuous data replication process in place to the disaster recovery site. As shown in figure 3, if storage array based Storage Area Network (SAN) replication is used, then using technologies like point-in-time copies of data can be used for disaster recovery exercises  by presenting point-in-time copies of data to hosts instead of directly attaching SAN to hosts. In this way, we need not have to stop the data replication during the exercise. In the figure provided above, there is a continuous data replication from local data center which is a primary site to remote site even during DR exercise. Testers are testing the data in point-in-time copies of data. Also in the above figure, as a best practice, a point in time copy and backup is taken at primary site which might help in resolving any major issues due to data corruption at the primary site. In storage array based replication, there is a risk that when the data at the primary site SAN is corrupted, then the secondary site SAN will also have the corrupted data. In that case both will become unusable. Hence one should consider this risk and design the DR replication solution accordingly.
Simplified and Automated Recovery Procedure to Resolve Issues in Involving Testing Team
Traditional recovery testing team would consists of several groups such as operating system, database, middleware, networking, storage, and datacenter operations team, etc. Since there are multiple teams (about 8-9 teams) involved in testing, this makes it more complicated in terms of scheduling the test. Also, some times during the test, if any other high priority production issue arises, then testers may need to leave in the middle of the testing since in most of the cases testers will be supporting the production environment as well. In order to reduce the complexity in scheduling and to avoid any interruptions, it is recommended to reduce these levels of dependency to a minimum level and create a DR tester who can run all these recovery steps alone and contact the respective (Database, Network, Storage, OS) system administrators only when there is any issue. The important aspect in this arrangement is that the recovery steps should be documented in such a way that it can be understandable and executable by any normal (L4) Helpdesk level person who will not have any high-level specific administrator (L2/L1) skills.
Besides testing, in a time of actual disaster, there is a tremendous amount of pressure and stress to get everything back up and running and available to users. In manual processes, mistakes will be made for a variety of reasons. Thus, it is suggested to automate the recovery process as much as possible. Having a simplified and automated disaster recovery process would eliminate the unnecessary time delay and manual errors during the recovery.
In most cases, simple scripts can help in reducing the recovery time considerably and in avoiding human error and dependency on skilled administrators during disasters. For example, scripts can be used for activities like mounting or unmounting the disk groups or changing the hard-coded host name in configuration parameters to point to the DR host name, etc.
Growing changes in technology and business models demand business processes which are heavily reliant on complex and interdependent applications, and therefore an organization should attempt process-oriented and “end-to-end” disaster recovery exercises for testing these applications instead of traditional server-centric exercises. Even though there are so many challenges in performing an end-to-end disaster recovery exercise, these challenges can be overcome by thoroughly analyzing the interdependencies and by using the appropriate sequence to bring up the dependent applications. An end-to-end disaster recovery exercise is the only way to effectively build the confidence among stakeholders on the recoverability of the disaster recovery environment and to understand the realistic RTO in a site level outage where multiple applications are impacted.
Shankar Subramaniyan, CISSP, CISM, ABCP, has more than 13 years of experience as a technology consulting and project management executive in the areas of IT Governance, Risk and Compliance (GRC) and business continuity planning. He is a certified professional and has hands-on experience in implementing disaster recovery solutions. He has implemented and managed Information Security Management System (ISMS) based on industry standards such as ISO27001 and ITIL. He has worked extensively on various compliance requirements including SoX, PCI, etc.