Business continuity plans are designed to ensure organizational survival. Among other things, they provide a roadmap to restore an IT infrastructure – the business backbone of today’s corporation – as quickly as possible after a disaster.
Many IT disaster recovery plans include some level of configuration information that is collected at a given “snapshot” point in time. Typically, this is a hardware and software asset catalog: vendor name, model number, serial number, location, etc. for hardware; and vendor name, version number, service pack information, etc. for software.
Most enterprises feel they have all the bases covered with these products and services. However, the speed of business restoration efforts is impeded by inadequate or absent documentation of the IT infrastructure that must be recreated. Even when available, access to a safe data center and backup tapes does not help IT staff (assuming they, too, are available) to quickly rebuild a network in an emergency. Detailed knowledge of server, database, and router configurations is essential to re-establish a working IT framework in which to restore corporate data.
For most organizations, information and the technology that supports it is the organization’s most valuable asset. More than 75 percent of the Global 2000 Corporations have installed enterprise resource planning systems (ERP); supply chain management systems; collaborative front office applications; and a host of other Internet and Intranet applications. These applications are deployed over multiple systems and databases – and across multiple locations. Many – but not all – mission-critical applications have their data backed up to tape that is usually stored in a safe site off the corporate premises.
Time Is Money
Restoring the IT infrastructure is the most crucial phase in keeping the business running in the event of a disaster.
The high cost of downtime goes beyond lost sales. Failure to perform can lead to contractual penalties. Customers who abandon you may never come back – and even if they do, the cost of sales increases due to a new competitive mix. If records such as invoices are lost, you may lose thousands or millions of dollars.
While you are waiting to restore your IT infrastructure, you still have to pay salaries, or suffer a public relations disaster. In the case of the Sept. 11 tragedy, your company’s reputation may not suffer, though your stock price, credit rating, and cash flow can be impacted. Spurred on by the events of Sept. 11, enterprise IT departments will be focusing more time and money in disaster recovery plans, equipment and services.
Disaster Recovery Plans Are Often Static
The IT disaster recovery plan has, until recently, been viewed as a static document that sits in a three-ring binder on every IT mid-level manager’s shelf that does little more than provide comfort that the IT department is ready to do its part to ensure business continuity.
Creating and updating the plan is usually an annual exercise, an initiative that pulls in resources from across the staff and disrupts the “normal” IT workload. Collecting configuration data from diverse platforms and “massaging” it into meaningful information takes a tremendous number of hours, and most IT departments do not devote resources to keep the information current.
Why don’t they? There are three main reasons:
First, almost no company has enough IT staff. According to the Information Technology Association of America (ITAA), of the current US IT workforce requirement of 10 million, there are more than 800,000 vacancies that cannot be filled due to the lack of trained talent. The workload increases but hiring never keeps up.
Second, the technical competence of individual IT talent varies with training and experience. Configuration documentation may seem an “entry level” task that most professionals seek to quickly move beyond. Disparate IT staff members often collect different types of information and the quality of their reports varies greatly. The more senior IT people are assigned to more critical tasks, deployed by management where they can provide the most value for their salaries, which average $85,000 per year ($75 per hour). The hours needed to assemble, verify and report configuration settings can amount to tens of thousands of dollars in a larger IT shop.
Third, IT staff turnover ranges from 8 percent to 17 percent, depending on industry and geographic marketplace. The costs of hiring and training new staff to replace lost employees is nearly triple the IT overhead cost (about $225 per hour). And when IT staff leaves or is lost, their knowledge of the corporate IT infrastructure leaves with them.
Two negative consequences result: First, any configuration data collected in these documents – even assuming it is accurate and consistently documented across critical application systems – rapidly becomes out of date due to the one constant in the IT world: change. Second (and until recently this was unthinkable), most disaster recovery plans assume the existing IT staff will be involved in the restoration.
Even if IT staff survives intact, and is available to assist recovery, the multitude of IT platforms and the large number of changes that occur on a daily basis limit their effectiveness to support a backup data center’s restoration efforts. Thus, the IT disaster recovery plan needs to be continuously updated with the latest configuration settings reported in a clear, consistent manner. All changes should be easily identifiable to preserve IT decisions from which backup staff can derive knowledge.
Are Backup Tapes Enough?
One of the most common reasons detailed configuration information is not recorded is the belief that backup tapes contain everything needed to restore systems into production.
The effectiveness of backup tapes depends upon the nature of the disaster. A system that experiences a simple power outage or hardware failure can easily be restored with backup tapes. If you have a hot backup site, you don’t even need to use tapes.
But undocumented tapes, while preserving business data, contain no configuration data, and cause delays in restoring critical applications.
Restoring such applications occurs during the functional restoration phase of disaster recovery. This phase can only be done once the infrastructure is properly reconfigured. A critical element is the most recent security settings. You need to ensure that the restored applications do not have any security holes when they are returned to production.
In general, an IT department that has the detailed configuration settings and the original operating system and application CD-ROMs can reach functional restoration up to 30 percent faster than by running backup tapes.
Throughout the multi-phase recovery process, detailed configuration documentation that contains change information allows the original IT staff and the restoration team to easily see, discuss, and alter any changes in configuration settings that occurred from the last safe settings.
It also enables other personnel unfamiliar with that infrastructure to get the network and business applications running again.
TYPICALLY MISSING FROM BACKUP TAPES
NT / 2000 Servers
• Share permission configuration information
• Services (e.g. startup information, accounts …) configuration information
• Application and system files in use generally do not make onto the backup tape
UNIX Servers Such As SOLARIS
• Host and network dependencies
• EEPROM settings such as specific boot instructions, SCSI ID manipulation, etc.
• Other KEY settings: initial system installation cluster, virtual memory swap space sizes, disk partition slices, space allocation considerations, etc.
• Kernel parameters and configuration settings that affect storage devices
Databases Such As ORACLE
• Storage parameters
• Schema objects: such as table dependencies and indexes.
• Security: what privileges are assigned to users and roles.
Routers And Switches Such As CISCO
• Everything: system backup tapes have no network device configuration information. Cisco routers and switches store configuration information in a file called “runningconfig.” This is usually (but not always) backed up by the network administrator on a TFTP Server. That is part of the internal IT infrastructure – it may be unavailable in a disaster situation.
Collecting And Maintaining Enterprise Configuration Data
Normally collecting and maintaining detailed configuration documentation is accomplished in one of three ways: manually checking all the settings on each network device, using specially designed tools that provide partial data on certain products, and automating the process with software that discovers, collects, and documents all key settings.
Where the process is done manually or with tools that provide parts of the required information, automated tools make this task easier by eliminating the work involved and increasing the speed that the information is collected. They also improve the accuracy of the information collected by eliminating human error that is inevitable when sorting through hundreds of thousands of key settings. The information is presented and preserved in a consistent report format.
Where the process is not being done, automated tools make it possible to accomplish the task for the first time. Some applications enable configuration setting data to be updated on a regular basis through automatic scheduling to provide the most current information available for disaster recovery.
Disasters fall into two general types, those that do not physically damage the IT Infrastructure and those that do. Automation delivers value in either scenario.
The first (and more common) type, where the infrastructure is not physically damaged, may result from a power failure or an act of cyber-terrorism.
Companies need to restart critical applications quickly. Every minute of downtime on an ERP application (e.g. SAP R/3) can cost a corporation upwards of $7,500. Automated products can provide the configuration settings from the last report prior to the disaster. This enables system and database administrators to restore the settings to the last “safe” settings prior to importing the latest data backup.
In the latter (and less frequent) case, where the infrastructure is physically damaged or destroyed, automation provides the configuration settings from the last report generated prior to the disaster. This information can be rapidly sent to third-party backup facility providers. The information can also be sent regularly to these firms every time a report is generated, enabling the site’s system and database administrators to restore the settings to the last “safe” settings prior to importing the latest data backup information.
Unlike an insurance policy where you need to have a “disaster” to realize the benefits, detailed configuration information and documentation can be used on a daily basis to improve the operations of the IT infrastructure.
Compliance reporting is a subset of a larger IT management requirement that is driven by individual industry requirements for security – both of the data being managed and of the IT Infrastructure itself. A critical component for being in compliance with these industry-specific mandates is possessing current and historical documentation that provides detailed configuration settings of the IT Infrastructure.
For example, the healthcare industry is working toward compliance with the Health Insurance Portability and Accountability Act of 1996 – known as HIPAA. Within HIPAA is the requirement for the security and confidentiality protection of electronic health information. Automated products contribute toward compliance with the security requirements of HIPAA by providing current (and historical) detailed configuration reports to support auditing, security, and disaster recovery.
There are similar reporting requirements in the financial industry, including Gramm-Leech-Bliley, mandated by the Federal Reserve System that requires recording detailed configuration settings for security and backup. Also, firms that are ISO 9000 compliant or working toward that certification also require extensive documentation of IT processes and policies.
In conclusion, managing configuration settings can reduce IT recovery time by as much as 30 percent. Collecting such information is an arduous task that few companies ever accomplish due to insufficient resources. There are products that can automate this process to provide and store constant updates. Collecting this information not only will eliminate downtime following a disaster, but avail an IT staff of data that is necessary for internal security, compliance, and optimum network efficiency.
Alex Bakman (email@example.com) is founder and CEO of Ecora Software, maker of IT infrastructure management tools for auditing, security, and disaster recovery.