Past examples of recovery plans have included such things as reciprocal agreements, cold sites and hot site recovery services. The explosion in computing power and workgroup processes outside of the glass house has made simple solutions an ever shrinking part of the overall recovery picture. The Meta Group says both the advent of CASE and the proliferation of distributed applications that reside on PCs, LANs and mid-range systems are complicating the disaster recovery planning process. For those companies that do not centrally manage these distributed environments (many are managed by end users), the task becomes more difficult by orders of magnitude. (Disastrous Disaster Recovery, META Group, 01/27/91)
What is often overlooked is the fact that disaster planning for LANs and distributed systems should be recognized as a priority by the highest levels of management. The information in use on distributed systems is no less critical to the organization than that in the glass house. Recovery of these client server systems should receive the same amount of attention that the recovery of core mainframe business applications has traditionally received.
The first steps to disaster recovery planning are the risk and business impact analysis. In order to understand the risk and associated business impacts presented by client server and distributed information systems, a complete understanding of the magnitude of your exposure is necessary. The first step to any recovery planning in this area is a complete audit of the amount, type and value of information in use by the enterprise.
Don’t underestimate the difficulty presented by this task. It is probable that no one person in your organization can even identify the number and type of network servers and workstations in use in your company, let alone the applications and data present on these platforms.
The focus of my attention will be on the backup and protection of data.
Information on the amount and type of information in use should be gathered, either internally though the use of specialized forms or externally by consultants that specialize in this type of data assessment.
Once all of the critical applications are identified and the sources of all data are profiled the information should be classified and then protected accordingly. Examples of classifications would include business critical, less critical and useless, as well as constantly changing, dynamic and unchanging.
Specific protection strategies can be developed to meet the requirements of these different types of information. Factors to be considered are the frequency of backup and the need for instant accessibility of restores.
There are three basic types of solutions that are evolving for backup protection in client server environments: Workgroup, departmental and enterprise. Each of these has particular strengths and weaknesses.
Workgroup solutions that rely on the end user or are administered at a local level are usually the first type of backup policy that is initiated. These backup products and policies tend to focus on individual applications. They use workstation or server attached tape backup devices to provide local protection of data; however, these solutions still should be included in the corporate wide strategy for data protection. Besides being limited in scope, these solutions tend to rely heavily on administrative intervention. They are, however, better than doing no backup.
The next step up the ladder is the departmental solution. Through the use of additional automation and migration features, these solutions focus on a larger pool of applications and data. More sophisticated management tools and distributed tools allow sharing of resources and reduced operator intervention. Unfortunately they still result in islands of data that often remain isolated from the corporate strategy. Additionally, these solutions tend to contribute to network capacity problems because of the random movement of large volumes of data through the backbone network.
Solutions that focus on the entire enterprise, allowing a corporate-wide data protection strategy to be managed from a central location, are increasing in popularity. These solutions span multiple computing platforms. They are easier to automate and allow greater economies of scale in peripheral usage. And, since the solution is managed by a central group, network and system capacity can be more effectively managed.
Companies are implementing distributed computing solutions at a pace that is far outpacing the technology and resources available to effectively manage and protect these applications. This is like walking a tightrope without a net.
With LAN and workgroup management resources strained, administrators are looking towards automated solutions to data management. When analyzing these tools a few issues should be kept in mind. First, software packages that advertise unattended backup may be too good to be true. While they may in fact schedule and start backup processes on servers or workstations, they may also require some kind of human action, such as changing and labeling tapes, downstream.
For example, operators may still have to mount and change tapes, label and track tapes and manage the packing and transportation of offsite storage materials.
Also, look closely at an application’s ability to handle unexpected errors. Does the backup continue after the problem is solved? Does it start over? In the event of single errors, is the rest of the backup data recoverable? Human intervention in the event of errors should be a last resort.
A very important feature of any backup package are the management and diagnostic reporting tools. When human intervention is required, it’s important to be able to locate and repair problems quickly. And management tools should be helpful in estimating backup times and planning for optimum resource usage in the network.
Client server computing addresses a critical business need: the ability to develop and implement new systems quickly. These solutions are focused on the requirements of individual business units. Because the needs of business units vary greatly, companies can end up with a large number of different computing platforms and operating systems. In order to address the effective recovery of these systems, management can either impose restrictions on the number and type of operating environments, or select recovery systems that connect and manage a wide variety of systems.
One of the fastest growing segments of client server development is the data base application. These applications, because of their availability requirements and constant state of change, pose unique problems for the backup manager. Managers should look for tools that perform native backup of open database applications, or for tools that allow the backup of a shut down database through raw partition copy. Obviously, if using the latter, speed of the solution is of paramount consideration in order to reduce downtime.
Another critical factor in selecting a backup strategy is management reporting and performance modeling capability. Especially important when managing enterprise solutions, reports should allow managers to identify problem areas in the backup process, bottlenecks, missed files, capacity issues and also help plan for growth. In general, the more robust the management reporting features, the more effective the package.
The great thing about standards is that there are so many to choose from. This tongue-in-cheek statement illustrates one of the biggest conundrums facing today’s business recovery manager. By adopting standards for computing platforms and application programs, the manager reduces the complexity of recovery planning, implementation and testing; however, these same standards restrict the ability of business units to develop their solutions. A careful balance of standards and flexibility will allow peaceful coexistence.
Within computing platforms, though, the processes of recovery are vastly simplified by standardizing on a single set of peripherals, add-in cards and software revision levels whenever possible. Deviation from this policy should be the exception rather than the rule. Draconian policies that force end users into resorting to skunk projects will isolate the workgroups further from MIS and the business resumption team. Allow unique solutions to be evaluated independently, with their business merits weighed against recovery cost whenever possible.
Selecting a common tape format has been complicated by an increasing number of entries. Quarter Inch Cartridge (QIC), DAT and the new Digital Linear Tape (DLT) are all vying for acceptance. A manager should analyze the speed, reliability and longevity of all tape types before selecting one. One of the most durable, and most often overlooked, entries is the 3480 cartridge tape. Improved connectivity tools continue to make this high speed and highly reliable device available to the enterprise.
A great effort has been made by Novell in the area of tape standardization. Storage Management Services (SMS) defines an architecture for interface to Novell and client systems through documented Application Program Interfaces (APIs). Back-end compatibility is addressed through a Systems Independent Data Format (SIDF), a tape format that addresses compatibility between tape types. You can back up your applications and associated data 10 times a day and if your tape breaks or your drive fails when it’s time to recover you’re still out of business. It’s because of this we say that backup is never the real problem.
Because of the critical nature of today’s distributed computing environment, reliability should be of paramount concern to the recovery planner. While today’s small personal tape devices continue to improve in performance, price and reliability, they still have a long way to go to catch up to the availability standards set by mainframe peripherals.
Careful consideration should be given to the useful life of the type of storage media being used for data backup. Some data must be stored for long periods of time, sometimes dozens of years, either for business reasons or regulatory requirements. For this data, optical beats magnetic media as the media of choice, since the media does not degrade over time; however, optical standards are still evolving, so choose carefully!
One of the hottest growth markets in the client server environment is in the area of storage management. Lately there has been a proliferation of tape backup and hierarchical storage tools for LANs; however, these LAN tools are considered stone age products when compared to the more mature tools in use in the mainframe computing environment. Partly because of the lack of data management tools, the annual per user cost of systems management in distributed systems is much higher than that of corresponding mainframe systems. ($1540/user Vs. $230/user - Enterprise Systems Journal, October, 1994.)
One distinction that should be made is the difference between backup and archive. The term backup refers to a duplicate copy of a data set that is held in storage in case the original data are lost or damaged. Archiving refers to the process of moving infrequently accessed data to less accessible and lower cost storage media, leaving a pointer behind so that applications needing the data are transparently provided with access. Archiving is a process that is intended to reduce the amount and cost of on-line disk storage.
Be careful when implementing these solutions together. Some archiving tools don’t cooperate well with backup tools. Some archiving packages see a backup process accessing a file name and transfer the file back to the source server. This causes hashing as files are moved back and forth out of archive in order to be backed up.
Some day there will probably be tools to provide enterprise hierarchical storage management that cover everything from mainframe to PC. Today, however, the business recovery manager must be aware of the patchwork of tool in use in the business, and plan accordingly. Few parts of a business resumption plan are as important as providing for the safe storage of critical information. But having a secure site at which to vault a recent copy of operational data is only part of the solution. Deciding which information to move, and how and when to move it are also crucial.
When workgroup and department backup strategies are used, coordinating the off-site storage of information is made more difficult than if all information is gathered to a central site using an enterprise backup product. LAN and workgroup administrators must use a manual process to gather, label and package tapes for the trip to the vault. These processes are subject to error, or worse, to not being done.
Additionally, information that has been boxed and is awaiting transport presents a window of exposure to theft or loss. And, by definition, if you are using scheduled pickup, your offsite data backup is out of date by the time it’s moved, meaning you’ll need to do some amount of data entry to perform a complete recovery.
Some large organizations have begun to experiment with real-time offsite data vaulting and journaling. Storage peripherals are moved to a remote site, and connected to the processors via high speed telecommunications lines. In the case of vaulting, entire images are transmitted to storage several times a day. Journaling applications post a copy of each transaction to both the local and remote storage sites. While the communications costs for these solutions are still rather high, there is some trade off in terms of personnel and peripheral costs and faster recovery times.
While complete loss of a workgroup is something to fear and plan for, the majority of data loss disasters are the result of human error -lost or deleted files being the most common problem. With this in mind, whatever backup solution you select should address the problems of both individual file recovery as well as full system recovery.
Backup products should not rely on an administrator to restore an end user’s files in the event of a loss. Administrative intervention causes delays and expense that are often intolerable in today’s fast paced world of business. Look for products that offer ad-hoc fast restore that is directly accessible to the end user.
Manual tape processes have the tendency to cause problems at restore time. Multiple copies of files exist on different tapes. Full backups, incremental and special backups all mix together in the cardboard box under the server table. Finding the right tape for the restore becomes what we call the needle in the haystack syndrome. Look for products that automates the indexing and labeling of tapes.
Anyone that has ever restored a single file from an old QIC drive knows the annoying problems presented by unindexed, unsegmented tape products. Endless whirring and repeated Insert Next Tape’ messages seem to take days to get through. Several vendors now offer products that provide fast restore capabilities that allow a single file to be found and restored quickly, without having to scan the entire tape pool. Automated tape robots can speed the recovery process even more.
Studies show that the greatest cost in data management comes from restoring, not backing up.
Look for solutions that ease this part of the burden. Differences in perception lead to language barriers.
For example, LAN administrators tend to measure performance in megabits per second while mainframe types talk in terms of gigabytes per hour. A simple rule of thumb for conversion is: 2.2 Mb/s = 1 GBh.
Factors that affect backup performance are: Processor power, network congestion, router hops and parallel backup streams.
The largest obstacle to the widespread acceptance of enterprise backup products is the relatively slow communications links between branch offices and corporate headquarters.
Finding a solution that can surmount this problem can be tricky. One possibility is to design a system that performs full backups locally and frequent incremental backups over the link.
Copies of the full backups can be sent off site for security, and the incremental backups provide the ability to restore closer to the point of failure. Often seen as a panacea for bandwidth limitations, data compression can provide the capability to drive more data over a link with a fixed amount of bandwidth; however, there is a cost associated with this.
Compression utilities use processor cycles to execute. If you are doing an incremental backup on a platform with limited processor power, the compression is likely to take as long as the actual data transmission. With this in mind, carefully weigh the importance of compression when selecting a backup tool.
Recent developments in the capabilities of port switched LAN hubs have made dynamic LAN reconfiguration a reality. What this means is that your LAN can be configured for user performance during peak hours and then reconfigured during off hours to facilitate LAN backup.
Security is a primary consideration of corporate data protection. Unfortunately, like most aspects of business recovery, security is difficult to factor into the data protection equation. Backup policy must be careful not to usurp the host and application security already in place in the enterprise.
End user confidentiality must be protected in order to maintain confidence in the system. Backup utilities that require supervisor or super user access to the client systems must provide some level of protection for the associated logon information. Security packages like RACF and Top Secret should be honored, as should local bindery or file permission information.
While it’s important for end users to have the ability to recover their own files, they should never have access to another user’s files. Administrators should only be able to backup and restore that information which belongs to users in their workgroups.
With this in mind, managers should attempt to select a solution that provides multiple levels of security appropriate to the requirements of the business. Additional data protection tools could include network security features like end-station authentication and data encryption.
Physical security of the storage media is also important. Complete and usable backup tapes are useful only if they aren’t lost or stolen during transportation or storage. The most important thing to keep in mind when designing a business recovery strategy for distributed computing environments is the individuality inherent in the system.
Workgroup computing solutions grow from unique requirements and their managers often must rely on their own devices to be successful. Because of this, they are often reluctant to participate in corporate MIS strategies. It is therefore important to get their buy in early when putting together the corporate plan. Also, recognize that these same unique needs may result in different requirements for recovery.
Define your objectives clearly, and let the individual line managers help define the strategy for their workgroups. Because the scope of this project will span many departments, realistic testing of the business recovery plan is critical. Anyone that has participated in a recovery exercise for a mainframe environment knows that the recovery never goes easily.
There is always a lost or late tape, tapes are mislabeled, communications links fail,... etc.. Now realize that recovery in a decentralized, distributed computing environment will be many times more difficult.
Recovering the workspace, computing platforms, communications links, and manual processes of several workgroups simultaneously will never be simple; but, regular realistic tests will help greatly in refining the plan to be more realistic and effective.
This testing will also help to maintain the plan. The same fast-paced development that makes distributed computing attractive to business also causes change to occur at a frantic pace.
It’s important to recognize this and keep your plan from getting too far out of date. While a well thought out comprehensive enterprise backup strategy will factor in change and growth, regular tests, audits and refinements should be a habit.
Roger Farnsworth is a product manager with Cisco Systems, Inc. This article was submitted by Barbara Dicken of the Editorial Advisory Board.