Ensuring System Accessibility with High Availability Technology
A UK-based study prepared by University of Warwick in 2007 found that hourly downtime costs were as high as £350,000 (almost $682,000 at time of writing) in the retail industry and £100,000 (almost $195,000) in the finance sector. These are likely very conservative estimates as they are considerably lower than often-cited statistics from a Meta Group report prepared in 2000.
When analyzing these losses, one can see they are comprised of both one-time and long-term components. The one-time costs include transaction losses, idle labor and other resources, restoration costs, penalties, and litigation. In extreme cases, such as healthcare treatment systems, downtime may delay service and even result in a loss of life. Long-term costs are associated with lost opportunities, lost customers, damage to reputation, and share price declines (if it is a public company).
Small businesses are no less susceptible to these costs. In spite of their lower absolute downtime costs, small businesses usually have less capacity to sustain losses; therefore their downtime costs can be at least equally unaffordable in a global marketplace that expects 24x7 availability and a supply chain that is hardened end-to-end.
In light of these costs, today’s businesses need a way to avoid downtime. High availability (HA) technology offers a solution. It protects not only data but also the applications that use it. Downtime cost avoidance isn’t the only reason to employ HA technologies. The following three factors further heighten the need for HA:
1. Consolidation: To reduce administration, security, maintenance, software licensing, and energy costs, many companies are consolidating formerly distributed servers onto one or a few larger systems. Yet, with many more eggs in far fewer baskets, the importance of each of those baskets grows significantly. The unavailability of one consolidated server may shut down all business operations, whereas the loss of a previously distributed server might have affected only a single business function or department. Thus, availability becomes much more important in a consolidated environment.
2. Regulation: A number of regulations, such as Sarbanes-Oxley, HIPAA, Gramm-Leach Bliley, and other laws demand that organizations protect the availability of data, making HA and disaster recovery capabilities not just “nice to haves,” but also legal requirements for many companies. Furthermore, within certain industry segments, financial auditors have included HA in their checklists.
3. The shrinking global village: Thanks to the Internet and efficient global transportation, even small firms now source materials from suppliers and sell to customers around the world. Consequently, systems must be available to interact with people and other systems in all time zones. As a result, there is no longer a suitable backup or maintenance downtime window. Data and systems must be available around the clock, without exception.
The need for higher availability will continue to expand. The demand for 24x7 system access is forecasted to grow to the point where, within the next five years, the negative impacts of downtime will be unacceptable for most companies. Yet, downtime of individual hardware, software, and databases is inevitable. Attention-grabbing events such as natural disasters and terrorism are always a threat, but extremely rare. Nonetheless, some downtime causes, such as data backups, database maintenance, hardware/software upgrades, and simple operator error occur exceptionally frequently, typically daily. In fact, Gartner, Inc., estimates that more than 80 percent of downtime is planned.
Despite the greater incidence of planned downtime, most companies focus instead on the unplanned variety. Nearly 80 percent of companies use a disaster recovery strategy that consists solely of performing regular saves to tapes and then storing those tapes offsite. A serious drawback of this strategy is that if a disaster requires the reloading of all data and applications, tape turnaround times, and operator availability issues lead to typical recovery times of 48 hours or longer.
Options to reduce recovery time include using disk protection (such as RAID and disk mirroring) to prevent some types of data loss, storing backup data on disk rather than tape at the recovery site, and installing HA technologies. Of these, HA provides the most complete way to mitigate planned and unplanned downtime impacts.
What Is High Availability?
HA technology minimizes outages by maintaining ready-to-run replicas of data and applications on a secondary computer. The HA software can then switch users to the backup should the primary computer become unavailable.
Every HA solution has four primary components: system-to-system communications, data replication processes, system monitoring functions, and role swapping capabilities. (After a role swap, the former backup system assumes the production role.)
Data Protection vs. Application Availability
Every company needs data protection. However, the need for always-available applications depends partly on the nature of the business. Consider the following examples: A large retailer that sells exclusively on the Web to a global market may lose millions of dollars of revenue and engender untold customer ill-will if its systems are down for even a brief period. And companies that sell software as a service often sign service level agreements that subject them to sizeable penalties if their systems are unavailable for more than a specified, small percentage of the time.
Consider also a manufacturer that has a high value of a work-in-progress that might be ruined by a system crash. A pharmaceutical company’s automated production line offers a good example of this. In the event of a manufacturing system data failure, everything from the first to the last touch point in the manufacturing process must be discarded to comply with medical safety regulations. In such a case, the high price of chemicals, along with the disposal and replacement requirements, represent monumental costs. All of the above types of organizations, among many others, demand very high application availability.
In contrast, consider a manufacturer that produces high value products such as large, specialized machines in small numbers using little automation. If the company’s systems become unavailable, production can still continue. In addition, because of the huge sticker price for each machine, orders arrive infrequently and can be taken manually if necessary. For this sort of company, application availability is generally not a critical issue.
In addition to varying between different types of businesses, the need for application availability differs among applications within a single organization. For example, an Internet retailer may not be willing to accept more than a few minutes of Web store downtime, but even during normal office hours it may be willing to accept an occasional hour or two of unplanned downtime for its human resources application. What’s more, planned downtime may be tolerable throughout the night or on weekends.
Because each business environment and requirements are unique, every organization must decide for itself whether any and, if so, how much application downtime is acceptable. To do so, analyze all possible risks and assess what outages the business can tolerate. Then evaluate the consequences of each downtime scenario to determine whether the need is data protection, application availability, or both.
Remember to consider both peak and slow times. For instance, a small trucking company may not suffer greatly during an hour-long system outage unless it occurs during the final sort and dispatch to prepare the trucks for departure.
Be rigorous. Do the risk assessment and determine both business requirements and any possible liability issues for customers. Design a disaster recovery solution around those needs. Maintain a high-level business viewpoint to ensure that all departmental needs are covered. But to avoid introducing compatibility issues that may introduce new risks when designing and purchasing HA solutions, it’s wise to limit the number of vendors that participate in the solution.
Flawlessly Flawed Replication
An issue that is often overlooked when considering availability issues is that HA technology is very good at maintaining perfect, hot standby replica systems — even when one would prefer it did not. If data on the primary system is corrupted by human error, malfeasance, or a technology failure in such a way that the data is still perfectly useable but wrong, an HA replicator likely won’t recognize the problem. Instead, it will blindly replicate the corrupted data to the back-up server.
When journaling is turned on, which is of course mandatory for journal-based HA replication, the problem is not quite as serious because database transactions are logged and could be ultimately audited and backed out. However, a system operator might find this to be a time-consuming task that is far from foolproof. It’s a task that will have to be performed on both the primary and back-up servers.
Thankfully, there is a better approach. Some newer HA technologies include continuous data protection (CDP) functionality that, in effect, stores all delta changes to business and system data and then allows operators to recover data from any point in time with just a few keystrokes.
Assessing Return on Investment
An HA investment decision should be based on expected ROI. In addition, the decision process should involve top management because an IT-centric view can lead to a larger than necessary investment that may still leave some important areas unprotected. For example, one company purchased an expensive HA system, but during a power outage it could not access the data center. Why? Because the low-cost physical security system (owned by the facilities department) was not backed up and the key cards did not work after the power came back on.
HA ROI is primarily derived from three areas: risk avoidance, maintenance accommodation, and improved hardware utilization.
1. Risk avoidance. An investment in technology that minimizes exposure to risk can deliver a substantial return. For example, one bank that was recovering from an outage sent a tape containing massive amounts of private customer information through the mail. The tape fell into the wrong hands, the incident was publicized, and the bank lost customers and share value. An investment in HA would have allowed this bank to recover from its outage instantly, with no need to send customer data through the mail. Just that single incident would have delivered a tremendous return on an HA investment.
When evaluating HA investments, consider also the frequency of various downtime events. For example, a major disaster such as an earthquake, fire, flood, or other natural calamity may occur only once every year or two — and rarely even that often. An equipment failure may occur only twice a year. However, someone may accidentally erase or corrupt a critical file as frequently as once a week. The prevention of losses from all of these incidents contributes to the ROI for HA technology.
2. Maintenance accommodation. HA is often thought of as a way to avoid unplanned business downtime, but that is insurance against an event that may or may not occur. System and data maintenance, on the other hand, is guaranteed to happen. By providing a replica system that users can employ when the primary system is undergoing maintenance, HA delivers an assured return. Furthermore, when it is time to migrate to new hardware, some HA products can shepherd the migration while the customer’s business remains active.
3. Hardware utilization. Some hardware-based replication technologies need to lock the data on the backup system, meaning users can’t access data on that system while it is serving as the backup. Other HA technologies, such as journal-based replication or Continuous Data Protection (CDP) do allow the use of the secondary computer to run reports, tape backups, or other jobs. As a result, the replica server can take over some of the processing load, thereby possibly deferring the need for costly server upgrades, which will improve the ROI of the HA software as well as the secondary hardware investment.
The ability to maximize ROI will be affected by one other factor: the support that the HA technology vendor provides. When a major hurricane bears down on the facilities, there probably won’t be time to log into the vendor’s Web site and learn what one needs to do to switch operations to the back-up location. Instead, one needs an HA vendor that, above and beyond supplying the software and hardware, also provides world-class service to ensure that the role-swap is executed properly and that the business survives the outage event.
Availability for All
Businesses contemplating the systems to protect data and provide continuous system accessibility should understand that the products in the HA market are mature technologies. They are affordable for a broad spectrum of needs and budgets, and their implementation does not depend on having a large, sophisticated IT operation. Provided their acquisition of HA is supported by a solid business case, these technologies are not just expenses. Rather, they represent investments that reduce the high risks and the exorbitant one-time and long-term costs of downtime, thereby providing a considerable ROI.
Henry Martinez is senior vice president of engineering with Vision Solutions, Inc., a vendor offering high availability and disaster recovery solutions for AIX, Windows and IBM i environments. Martinez has a broad spectrum of experience in the application of Information Technologies including traditional IT, consumer electronics, the utility industry, aerospace, and industrial automation. He has managed global, virtual R&D labs, launched high-technology start-ups, established domestic and international manufacturing operations, and negotiated and implemented technology alliances/transfers.
"Appeared in DRJ's Winter 2009 Issue"