Your System Is Down, Do You Know Where Your Data Is?
- Published on October 26, 2007
A Disaster Scenario
Assume a disaster occurs that destroys a critical information system. As a result, the user learns that:
- A critical information system is destroyed and will be unavailable for an extended period of time (assume five days).
- The data that the user will have when the system is restored will be as old as the last backup that was taken and stored off-site.
- All storage devices at the computer location were destroyed making it impossible to retrieve data off of these devices.
- Considerable time had elapsed between the last off-site backup and the disaster.
If a procedure has not been developed that instructs the user on the status of their information and how to manage their information through the recovery process, serious issues begin to surface, for example:
- What has happened to all of the information that was added, changed or deleted between the time the last backup was taken and the disaster occurred?
- How will the user get needed information while the system in down?
- How will information be captured between the time of the disaster and the system is available again?
- How will the data be re-entered and synchronized with other applications once the system is available?
A Time Line
The timeline (see figure 1) shows the impact to the user as certain events take place through the recovery process.
If it's necessary for an organization to minimize or eliminate data loss, there are a number of options. Some common ones include:
- Shortening the time between backups - knowing that backup schedules often interfere with production activities, evaluate the current schedule and find an optimum balance between backups and production. As the frequency of backups and off-site storage schedules increases, lost data after a disaster would decrease, however, manual procedures for completing job functions would still be required for critical functions.
- Electronic Vaulting - Store an electronic copy of critical transactions at a site that is remote of the primary location where the application processes. In the event of a disaster, these transactions can be recovered at an alternate recovery site in quick fashion.
- Desk procedures - Create procedures that users follow at their workstations in order to track a copy of critical transactions (this can be automated or accomplished manually). User desk procedures would also include:
- how the user validates that transactions entered between the time of the last backup and the disaster event are available
- how the user captures records while systems are down and then re-enters them when the system is restored
- alternative ways of getting information (manually/automated) while systems were down
- how the user would perform their job functions (alternative methods) while their systems were down.
There are a number of reasons why an organization strives to minimize data loss and to have procedures ready after a disaster, for example there may be:
- Federal, state or local regulations
- A high financial value of transactions
- A large volume of transactions
- A larger number of users
- A high cost of doing business manually while systems are down.
How an organization chooses to address data loss as well as alternative work methods can include one of the options listed above or a combination of these and other options.
A key component to any option selected however, is the synchronization of backups and data reconstruction procedures.
Data must be backed up and restored in a fashion that allows recovered systems to be in harmony with one another. If one system is restored out of synch with another (assuming a logical relationship between the two exists), data integrity may be questionable.
In an urgent recovery situation, data corruption caused by out-of-synch recovery may not be immediately evident and only after time does the situation become apparent - this left unchecked could be catastrophic to the organization.
An Action Plan
To manage the issue, an organization should first establish the requirements and scope of minimizing data loss and desk procedures - some systems are more critical than others and systems can process across multiple platforms.
The following action plan is a general approach:
1. Assemble a team which can prioritize applications and produce a list of which applications are critical to the organization (consider the criteria listed above)
2. Identify if any logical relationships between critical applications and other applications exist
3. Produce a list of each application that requires minimal data loss and/or end user desk procedures - the list should include the name of the application, a description of the application, what it is logically related to, where the application runs, what machine it runs on, who owns the application, who the users are and where they are located and who manages the hardware the applications run on
4. Identify the backup requirements (type, time, frequency) for each application as well as those applications that have a logical relationship to the critical application
5. Evaluate current backup practices for each application and determine if backups are timed appropriately - this should also include the backups for logically related applications
6. Document the backup schedule for each application
7. Evaluate and select which options are most appropriate for reducing data loss (see list above for thought starters)
8. Develop an implementation plan for the selected option
9. Based on which options are chosen for those applications that require end-user desk procedures, consider these steps:
- Organize sub-teams of information management and application user representatives and have the sub-teams develop generic desk procedures (i.e., how to capture critical records between backups and what to do while systems are down and how to re-enter records after the system is restored)
- Pilot the generic desk procedure and then finalize any revisions
- Distribute desk procedure to user locations
10. Evaluate data restore procedures for each critical application and logically related applications
11. Develop a schedule to revise any data restore procedures as needed
12. Schedule periodic recovery drills which involve all levels of recovery (disaster notification, manual capture of transactions, system recovery, data restoration and synchronization)
The development of a desk procedure in this case is not in lieu of a comprehensive business resumption plan for the functional area.
A business resumption plan addresses the loss of any critical resource needed to perform critical business functions - the desk procedure would be an add to the business resumption plan.
It's also important to remember that not all applications will require this level of planning, only those that fit the type of criteria outlined above.
The real opportunity to manage this is to ensure that it be addressed in the application development process.
This is the best time to understand all of the relationships between a new application and any existing applications.
As well, there is typically close communication with the end-user to understand the volume and value of the transactions.
Also, the application development team can also work with end-users to develop the desk procedures as part of new application development and deployment.
Gerard J. Minnich, CDRP, is a business continuity program manager.