WHAT ARE THE RISKS?
Unexpected electronic failures that cause communication blockages are the most frequent cause of IT disruption in the workplace. The most common declared IT disasters in the U.S. are caused by power surges and outages, followed by storms and floods. The recent electricity shortage in California and the power outages and blackouts caused by sweltering temperatures are just a few examples of typical situations that cause unanticipated network disruptions.
In addition to communication blockages, data loss or corruption is also quite common and is typically caused by viruses, software or hardware glitches, or poorly designed backup procedures. If business requirements mandate that backups be performed hourly, but are only done on a nightly or weekly basis, accurate data recovery is difficult at best. For e-businesses that process millions of transactions per hour, outdated data by just 30 minutes can be catastrophic!
Hardware and component failures can be just as hazardous. Unavailable replacement parts or slow response time from service providers can keep systems down, causing unnecessary grief for customers, needless frustration for employees, and sanctions against the business. For example, if electronic stock trading or financial-based web sites are inaccessible due to simple component failures, firms face the possibility of fines and potential legal action.
EXPOSURES AND LIABILITIES
With e-businesses feverishly working to establish a strong reputation and brand recognition, blocked web site access can be extremely damaging. Customers and investors alike view problematic web-based transactions negatively, thus reducing the credibility of the affected firm and causing a possible loss in revenue, stock value, or necessary funding.
Financial losses due to the lack of adequate system and data backup strategies can be staggering. For example, the average cost of downtime in the retail brokerage industry is approximately $6.45 million per hour (Source: Contingency Research Planning, a Division of Eagle Rock Alliance Corporation, West Orange, NJ). Costs incurred by e-businesses are just as troubling. Table 'A' provides a snapshot of the lost revenue per hour based solely on daily e-commerce revenues. These figures do not take into account the hidden costs of reduced customer satisfaction or the damage to the corporation's brand name and reputation. When adding up the hard and soft costs, it is easy to see the value of implementing business recovery processes.
Not only should e-businesses be concerned with immediate data accessibility, but they must also adhere to legal and regulatory requirements related to the types of information that must be kept. For example, the IRS requires that certain financial documents be retained for a specific time period. If an organization is audited and the legally mandated information cannot be produced, substantial penalties can be imposed. Other business documents such as human resources data required by regulatory agencies, contracts and purchase orders, patents, trademarks, and technical specifications can have critical legal and business implications if lost.
The financial services and pharmaceutical industries also have stringent record retention requirements and are required by law to incorporate business continuity plans.
Restricted web access, legal liabilities, or regulatory non-compliance can cripple even the most profitable e-firm. Business recovery and continuity processes ensure critical data and access to it during an IT disaster.
LAYERS OF IT BUSINESS CONTINUITY
IT business continuity and recovery processes effect virtually all layers of the enterprisewide information infrastructure. Data, applications, systems and networks, and the data center physical site each require different backup and recovery strategies to ensure that all functions are appropriately protected. Consider the typical IT infrastructure and some of the recovery processes that might be used for protecting each respective area:
DATA: The innermost layer of any network is the actual data that is captured. Because data is the backbone of today's organizations, immediate recovery of data during IT outages is key to survival. The recovery methods of choice for the protection of data include automated backups, off-site media storage, data mirroring, and more prevalent, remote data replication. Copies of critical data should be backed up onto tape on a routine basis and stored at temperature controlled off-site storage facilities for safekeeping and quick retrieval, if needed.
Data mirroring allows organizations to create a duplicate logical copy of mission critical data as an emergency on-line backup; if the primary data is lost or corrupted, the 'mirrored' data is instantly used in its place. Level 1 RAID is commonly used for this purpose.
By duplicating data simultaneously to secondary systems, continuous access is assured should a primary system failure occur. This data replication provides another way to safeguard and recover the data by creating multiple copies on either local or remote systems. This eliminates the wait time required for loading and restoring backup tapes after a disaster, as the replicated data can be substituted quickly.
APPLICATIONS: In order to retrieve data after a system fails, databases and other applications must be backed up and archive logs need to be maintained. This will ensure that data can be accessed if the software on the main network becomes corrupted. Tape or optical libraries coupled with backup software that supports on-line and standby database and application backups are viable solutions for the protection of critical processing functions.
SYSTEMS AND NETWORKS: A cost-effective option for insuring system availability is clustering. A 'cluster' is a group of interrelated servers working together to perform various jobs. Within a clustered environment, servers are designated primary tasks to handle (such as running e-mail or web-based transactions), with secondary servers set up within the cluster to assume the processing of these tasks if the primary server fails. This automatic failover capability provides an uptime of about 99.9%.
A more expensive system availability concept involves the utilization of fully redundant components and access paths. With this implementation, duplicate devices work concurrently, insuring uninterrupted operations because if one fails, the duplicate component takes over immediately. These fault tolerant systems offer 100% availability that can help ensure immediate accessibility to critical data and guarantee 24X7 operation.
A popular storage subsystem disaster recovery option is known as hot swapping. With this methodology, if a component fails within the storage device, it can be replaced on-the-fly (i.e., without having to bring the device down). For example, some storage devices contain multiple power supplies, cooling fans, and drives, allowing these components to be changed while the device is in operation.
SITE OR DATA CENTER: Not only do the individual layers within the infrastructure need to be protected, but the physical data center itself must be safeguarded. Hot sites might be used in situations where extended periods of downtime cannot be tolerated, such as with on-line transaction processing web sites, hospitals, or air traffic control centers. Hot sites are pre-configured data centers that are either maintained at a separate location, or contracted through a supplier. These sites may either contain an exact replica of hardware, software, and communication devices employed at the main site, or provide only a configuration that is similar to the main data center that is capable of processing only selected crucial business information.
Hot sites permit mission critical operations to continue uninterrupted, by switching to the backup center if an IT disaster strikes.
Alternatively, mobile data centers and cold sites (pre-configured locations in standby status ready for backup equipment to be installed) are viable strategies for organizations that can sustain a few days of downtime before recovery.
Identifying the appropriate business contingency strategies as well as the necessary hardware and software solutions is critical to implementing a successful IT recovery process.
There is a vast array of data storage technologies available to support access to protection and recovery of data. Two of the latest techniques are briefly depicted below:
SAN (Storage Area Network): A SAN is a storage network architecture that provides an interface between multiple servers and mixed storage devices. The SAN acts as a type of storage repository, moving the storage devices onto a separate network connected to the main network, usually by way of a switched Fibre Channel fabric. Fibre Channel connectivity provides the ability to increase the distance between the main network and remote storage devices from that possible via SCSI connections. This is extremely beneficial for organizations requiring enterprisewide backup over multi-building sites and campuses. For businesses implementing disaster recovery programs, Fibre Channel connectivity means that the backup storage units can be located far from the data center and, therefore, would not be impacted by disasters such as fires, floods, or explosions effecting the main location.
For desktop or laptop users, new automated data backup solutions are also available. This can be ideal for small organizations that do not have the IT resources to either assist employees with backups or enforce storing data on the network. What better way to protect critical data maintained by mobile employees than by initiating backups automatically and transparently?
The lack of a business continuity plan can greatly increase the risk of lost data and system and network downtime. With all e-businesses fully dependent on their IT operations, business continuity is no longer just an IT department concern; it is an executive management issue that must be addressed with the highest priority. Information is perhaps the most valuable asset of a modern company. With e-enterprises processing mounds of critical information on a minute-by-minute basis, it is essential that data and the access to it be 100% protected.
Belinda Wilson, CBCP, is the North America Program Manager and Global Service Manager for Business Continuity consulting at Hewlett-Packard. Ms. Wilson has over ten years of expertise in the area of business continuity, recovery, and high-availability, having assisted a number of HP's clients with successful programs. Ms. Wilson is a Certified Business Continuity Professional, has served on the Certification Board of the Disaster Recovery Institute, and is also an instructor for the Disaster Recovery Institute.