Growing reliance on Information Technology, along with compliance and regulatory requirements has led many organizations to focus on business continuity (BC) and disaster recovery (DR) solutions. Availability has become a major concern for business survival. This document lays down some of the best practices and learning for implementing disaster recovery solutions. The information shared in this document is based on the experience, obtained during the execution of disaster recovery implementation projects.
Integrated Business Continuity Management (BCM)
What is the Scope Difference Between Business Resumption, Disaster Recovery and Emergence Preparedness?
DR, business resumption and emergency preparedness are three integrated components of a business continuity management solution. Some organizations focus on any one or part of the solution. It is possible to have a perfectly functioning redundant data center and still business cannot continue. So it is important that all the below three components of business continuity management are tightly coupled and integrated to ensure successful business continuity.
- Emergency preparedness
- Business resumption
- Disaster recovery
Disaster recovery addresses recovery of critical IT infrastructure such as hardware, software, telecom, network and data for bringing up the mission critical applications to support the business.
Business recovery involves the recovery of critical business functions and processes that relate or support the delivery of core products/services to customers. It focuses on products/services, non-IT employees, vital records and other stakeholders involved in supporting critical business functions.
Emergency preparedness is designed to enable an effective response to an event. It focuses on stabilizing the situation. Generally this is coordinated by a team responsible for health & safety of the employees. The scope of emergency preparedness includes utility failures, non-availability of employees/pandemic preparedness and non-availability of facilities due to bomb threats, etc. It also handles transportation and coordination with external agencies.
Lesson Learned: A complete BCM solution requires the tight integration of emergency preparedness, disaster recovery and business resumption.
Defining Project Scope
How to Stop Scope Creepxs
One of the most important steps in DR project management is defining the disaster scope. It will help the project team to maintain the control of the project. Clearly understanding what's included in a project is the only way of guaranteeing its success.
Disaster scope should include the nature of impact and timeline. For example, it may be something like, “If the primary data center is down in the event of any major incident, how to survive the critical business processes and applications for 3-4 weeks until the primary data center is brought up.”
The clear disaster scope definition will help in setting up the expectations very clearly and iron out all issues in the scope.
Another challenge is the ongoing upgrade and development changes in applications. There will be a potential delay because of these changes and upgrades in completing the DR project. So the DR project team should ensure a mutually agreed cutoff date and communicate to the application team that the DR environment will be similar to production as of that cutoff date. If any changes or upgrades are required, first it should be implemented in production before the cutoff date and the DR environment will be implemented similar to production as on that date and it will not have any new or additional features. After the cutoff date if any changes happen in production, the application team has to communicate and implement the same changes to the DR environment also. This helps to avoid any technical refresh or upgrade project merging with DR project.
Also, we need to ensure disaster recovery is not confused with high availability. High availability (HA) is for local system or architecture failures and generally HA is a solution under the same roof. But disaster recovery allows an application to be recovered at an alternate site away from the production site in case a higher level failure or disaster strikes.
Lesson Learned: One of the important steps in the planning phase is developing the proper disaster scope definition which outlines the likely scenario and its impact.
Effective Change Management
How to Ensure that DR and Production Environments are Kept Equivalent
It is also critically important to note that often, as new features are added or the configurations for particular applications change, some of the changes may not be properly implemented to the DR environment. When this happens, the application at the alternate location is incomplete and applications will not function as required in the event of a disaster.
With a data replication through SAN strategy, it is possible to replicate all binaries, configuration files, data files, etc., from the Production environment to the DR environment so that all changes will be replicated automatically. Scheduler-based or script-based solutions are implemented for replication wherever it is not feasible or economical to replicate through SAN. Also, the DR environment should be integrated with the organization’s change management process to avoid any manual errors.
Lesson Learned: Integration of DR with change management is very critical.
Hardware Resource Planning
How to Handle Hardware Planning
One of the other major challenges faced in resource planning is end of life (EOL) hardware. This will be a common case if the application is older and has been in use for more than 10-15 years. If the production environment is in EOL hardware, then there will be a requirement to buy new equivalent hardware. The issue in new hardware is that it will not support older operating system (OS) versions. Sometime, the new M5000 server procured for DR will support only Solaris 10 whereas the production environment is still running on Solaris 9. So it may be necessary to upgrade the production environment to Solaris 10 or procure hardware which will be compatible with the old OS.
Lesson Learned: Proper hardware identification and sizing is a major challenge in setting up a new DR environment.
Software License Management
How to Handle Licensing Issues
Another major issue faced is in arranging software licenses, specifically for databases and other middleware applications in the DR environment. The vendor may insist on procuring additional licenses. Licensing for the DR environment is based on the software agreement one has signed while procuring the software. Procurement, legal and the senior management team should be involved to resolve this issue.
Lesson Learned: Provision for a DR solution should be mentioned clearly in the software agreement and coordination with procurement and legal is vital in handling software licenses for the DR environment.
How to Perform an Effective BIA
Sometimes it may be necessary to conduct a BIA in a very short time and it may not be possible to conduct a detailed BIA for all the applications in the organization. In those cases, it is suggested to first have a discussion with senior management of each business division to identify the candidates for a detailed BIA. It is also necessary that the audit team and IT support team be involved in the discussion and short listing of the applications which are considered as very critical by respective business units. Then a BIA questionnaire can be sent to the respective application owners to determine the impact, RTO/RPO and dependencies. One of the common challenges in conducting a BIA is explaining RTO and RPO to the business users and ensuring that there is no duplication or overlap in any impacts among the upstream and downstream applications. It may be necessary to conduct several awareness sessions and meetings to ensure all the doubts are clarified. It is also necessary to ensure that BIAs should be signed off by their respective business finance team to avoid any errors in estimating the impact.
Lesson Learned: Ensure that the BIA is clearly understood by the person filling it out, and validate the financial impact with respective finance teams to avoid any errors in impact estimation.
Maximizing the Return on Investment in Hardware
Can the DR Hardware be utilized economically?
In general, DR solutions cost too much, requiring enormous investment in additional server and networking hardware to replicate existing data centers – increasing infrastructure needs accordingly. These expenditures inflate the cost of IT, while reducing average system utilization. These cost and complexity challenges have effectively restricted or degraded many IT disaster recovery plans.
It may be very difficult to convince management to purchase new hardware when it is known to everyone that new hardware is going to be kept idle until a disaster strikes the primary site. Also, another issue is that when it is known that the performance in new hardware is going to be better than the existing old production hardware, there will be pressures to use the new hardware for production.
One solution considered in those cases is to use the old hardware for DR and use the new hardware as production. Another solution may be to use repurposing software which can allow servers to be used as a staging or QA environment during normal circumstances and bring up the DR environment quickly when disaster strikes. This way the hardware procured is not kept idle and is used effectively in normal times.
Also, it is mandatory to analyze the consolidation and virtualization options while planning any hardware requirements for DR which can reduce the hardware requirements considerably.
Lesson Learned: Consolidation, Virtualization and Repurposing software will be very useful in optimizing the cost of hardware.
Improving the Recovery Procedure
How to Improve the Response and Reduce the Recovery Time
In a time of disaster, there is a tremendous amount of pressure and stress to get everything back up and running and available to users. In manual processes, mistakes will be made for a variety of reasons. Thus, it is suggested to automate the recovery process as much as possible. Having a simplified and automated disaster recovery processes would eliminate the unnecessary time delay and manual errors during the recovery.
Also, traditional recovery procedures involve several groups such as operating system, database, etc. It is recommended to reduce these levels of dependency to a minimum level and create a first responder who can run all these procedures alone as an immediate step and contact the respective administrators only when there is an issue in the procedure. Simple UNIX scripts can be used to automate most of the steps in the recovery procedure which can simplify the steps and avoid any manual error in syntax and reduce the recovery time. These steps are helpful to have a better recovery procedure to respond very quickly to any disaster incident.
Lesson Learned: Simple UNIX scripts can make the recovery steps less complex, avoid manual errors and reduce the recovery time substantially.
Some Common Pitfalls in Storage Area Network (SAN) Replication
One of the major issues in SAN replication is handling data corruption at the primary site. In storage-based replication when the data at the primary site SAN is corrupted, then the secondary site SAN will also have the corrupted data. In that case both will become unusable.
Also, when we are deploying SAN-based replication, other issues to be considered are ensuring data integrity among application group and temporary replication link failures. Some of the SAN replication solutions have the built-in feature of having the time stamp on the replication copies and enables recovery from the previous copies.
Another solution available is to have a snapshot solution along with SAN replication which helps to recover from the last snapshot. These solutions will help if the primary data is corrupted. Some SAN products on the market have features such as consistency group and journaling. Consistency group is a collection of storage volumes across multiple storage units that are managed together when creating consistent copies of data. Consistency groups ensure that all dependent applications or databases are restored to the same point in time. This will avoid any mismatch between data restored between different applications or databases and data integrity across different components of the application group.
Journaling is written to disk, not to cache, which allows longer timeline malfunctions without ceasing the remote copy. It will keep the “WRITEs” in its buffer if there is a storage replication link failure. The size of the journal disks determines the maximum time of outage it can tolerate. If the size is larger, a longer outage can be tolerated.
Lesson Learned: Snapshot/shadow image, consistency group and journaling should be considered while deploying a SAN-based replication solution.
Automated Tools for Crisis Communication
A disaster recovery plan is only as effective as the ability to communicate with and activate the recovery team. When an emergency strikes, it may not be effective and reliable to use the traditional methods of calling by phone each and everyone individually. Therefore, in order to build a fast, multi-channel and reliable communication method during an emergency, some of the emergency communication tools can be included in the crisis communication plan.
Global changes in business models, heavier reliance on information technology, and recent developments in disaster recovery technologies are forcing organizations to reformulate their DR solutions. Some of the lessons and best practices mentioned in this document can be utilized to create viable and successful DR and business continuity solutions for clients.
Shankar Subramaniyan, CISSP, CISM, PMP, ABCP, has more than 13 years of experience as a technology consulting and project management executive in the areas of IT Governance, Risk and Compliance (GRC), and Business Continuity Planning. He is a certified professional and has hands on experience in implementing Disaster Recovery solutions. He has implemented and managed Information Security Management System (ISMS) based on industry standards like ISO27001 and ITIL. He has worked extensively on various compliance requirements like SoX, PCI etc.