What Are the Threats to Availability?
If high availability is your goal, what downtime threats do you need to protect against? There are many, but we can consolidate them under six categories:
- Data Corruption: The most prevalent threat to service and data availability. A file gets deleted or corrupted, and services need to be taken offline to correct the problem. The most common defense is backup or frequent snapshots.
- Component Failure: Hardware malfunctions are common – a network card, an array, a disk drive. There are fewer and fewer failures as vendors build more resiliency into their products, but hardware failure still remains a significant threat to availability.
- Application Failure: Downtime associated with unavailable applications are common. They inevitably lead to loss of productivity and revenue.
- Human Error: When processes remain dependent on human intervention, this interaction can introduce a multitude of errors that quickly lead to lost availability.
These four threats are all unplanned, and they force IT departments into reactive mode. The fifth threat to availability is a proactive measure that is viewed as a necessary evil – planned downtime.
- Maintenance: This is by far the greatest contributor to downtime in any environment. According to a leading analyst firm, 80 percent of downtime is planned – server upgrades, application upgrades, OS upgrades, and other site maintenance processes.
And even if you’ve done a good job protecting against these five threats, there’s still a sixth threat to consider.
- Site Outage: Any high availability solution must include protection against the total loss of your site in the event of a disaster such as a fire or a flood.
Software Technologies vs. Availability Threats
Let’s review the tools that are available for defending our environments against these threats. The most basic availability level is backup to disk or tape – the industry’s fundamental safety net for IT infrastructures. Back-up reduces the amount of data loss from data corruption to about 24 hours, depending on how often daily backups are taken.
If your company can’t afford to lose data for that long, the next level of protection against data corruption is local mirroring – creating a constantly updated copy of data on disk to provide real-time availability within the data center.
When the availability of your data has been established, the next concern is server availability and the protection of business applications. Local clustering technology lets you group several servers into a single resource. Failure on any server results in a failover to another server in the cluster, and availability is protected, reducing downtime to minutes and, in some cases, seconds.
Backups, mirroring, and local clustering can protect you against local threats to availability – the first five threats that we described in the previous section. But what if the unthinkable happens – a disaster that knocks out your entire site? You can protect your total environment by establishing availability of data and resources at a remote site. You have two tools at your disposal to accomplish this: replication and clustering.
Replication enables you to create a copy of your data online, in real time, to disk storage at another location. Clustering goes a step further; it combines replication of the application with the data. This means that if you have a complete outage at your primary site, a single button-click will restore service at your back-up environment. That is the highest level of availability you can achieve.
The combination of these technologies not only provides 24x7 availability but offers significant cost savings and impressive return on investment (ROI) for IT departments. With that in mind, let’s examine our four myths more closely:
Myth No. 1:
High Availability Costs Too Much
The popular view is that to achieve high availability you must double your complement of hardware – duplicate server capacity for each application, duplicate systems, and duplicate sites, all running on the same server type and the same level of the operating system. Most shops that use clustering software run an active/passive environment, with servers running idle against the possibility of a failure.
Fortunately, these assumptions – that you have to live with poor hardware utilization and OS constraints to get high availability – are not true. You can add clustering and high availability to your environment without buying more hardware. Software solutions are available that enable you to use existing resources to create a high-availability environment across vendor platforms without OS restrictions. You can manage multiple different servers using the same clustering solution across Sun Solaris, HP-UX, IBM AIX, Windows, or Linux at a variety of OS version levels to create a single high availability solution with high server utilization rates.
Let’s look at a typical high availability environment with paired servers and a variety of operating systems. Some server pairs are active/passive; this offers the highest levels of availability but is the most expensive approach because one server is always totally idle, tapping its fingers while it waits for a failure. Some server pairs are active/active, which is much more cost-effective. However, it may provide much lower availability than the active/passive approach, because in the event of a failure the server that is still functioning has to do the work of two servers, and its performance will drop off as a result, thereby reducing the overall availability of the solution.
A single solution can be used across platforms and bring all servers into a single clustered environment, providing high availability and high server utilization rates, maximizing your hardware investment, and eliminating the high-cost myth.
Myth No. 2:
High Availability is Too Complex
High availability is usually seen as complicated because the traditional hardware-based approach requires that vendors install clusters – an expensive proposition – and charge again for professional services every time a new application comes on line.
Then there is the problem of labor-intensive management. It is time-consuming and demanding to manage high availability across a variety of servers, operating systems, and applications. If you operate five different server platforms across multiple applications, each of them will demand a different clustering solution and your already-high administrative costs will rise.
You can add high availability, avoid all this complexity, and reduce costs with a software solution that allows you to use the same clustering platform across different platforms and operating systems. Once you have the first cluster in place – a matter of minutes – any administrator can add to the cluster or build new clusters quickly and easily.
When you change the configuration of any node with the clustering tool, you can extend the changes to all other nodes. All nodes are managed from a single graphical user interface (GUI), with easy failover across nodes within a cluster or across a distance.
Myth No. 3:
It’s Too Hard to Measure
One problem with traditional approaches to high availability is that there is no satisfactory way to measure results. The IT department may, for example, have a service level agreement (SLA) with a business unit that states that there will be no more than two hours of downtime over five nights. The department may achieve these goals but have no way of knowing it. There may well be an availability problem, but with immeasurable SLAs and no historical reports, there’s really no way to verify performance or identify problems. Is it an application failure? Component failure? Human error? Unfortunately, in many cases, nobody knows.
But integrated reporting tools are now available, and they can enable you to track availability, report results, analyze trends, and identify problems. They enable you to say with authority that you are meeting SLA requirements and can support your statements with historical reports.
Myth No. 4:
It’s Too Hard to Test
IT managers who implement disaster recovery solutions face the problem of uncertainty. It is impossible to be 100 percent sure that your configuration will work until it’s actually in production and capable of causing serious downtime – which is what you’re trying to eliminate in the first place. So the system must be tested.
But testing availability creates a paradox – you have to risk losing availability to see if your availability systems work. Consequently, companies spend millions to implement a disaster recovery plan but are never certain the plan will work because they don’t want to risk downtime by testing it. They seem to be operating on blind faith.
If they do decide to test a disaster recovery plan, it will be inconvenient and time-consuming. It will involve many steps, and it will almost certainly take place over the weekend or in the middle of the night.
On the other hand, you can simply use a fire drill process to test your disaster recovery solution on a spare system before you put it into production. The firedrill creates a clone copy of your environment, including clustering and replication processes, and tests it anytime without impacting production at all. You then know positively how your disaster recovery plan will work.
We find that many companies don’t implement disaster recovery because they believe it’s simply too costly. They think mostly in terms of protecting data at a secondary site, but they don’t think about getting the application running again so the data can be accessed. They consider automated restoration of applications to be unachievable today, so they focus on data protection.
Cost-effective software technology available today can integrate the restoration of both your business-critical application and your data. It automates the entire disaster recovery process to eliminate potential downtime from human error. It is literally a one-click operation. From a single solution in a single cluster, you can implement the integrated restoration at any distance: locally, across the street, or even across the globe.
Most companies continue to use the traditional hardware approach to replication, which means that the computing and storage hardware at the secondary site must duplicate the hardware at the primary site. While this is a popular method it is also extremely expensive because it is proprietary and leaves IT departments with no other choices where vendors and operating systems are concerned. Furthermore, it has distance limitations because it requires dual dedicated FibreChannel connectivity for short runs. For longer runs, it becomes more costly, requiring FibreChannel-over-IP hardware converter devices.
On the other hand, why not replicate the volume instead of the hardware? Volume management software tools can give you bulletproof replication over any hardware, over any network, and over any distance delivering recoverable data at a much lower cost.
Knowledge about the use of innovative software technology dispels the myths surrounding high availability. It is much less costly than traditional hardware solutions. It is not complex to manage. In fact, it vastly simplifies management. It can be implemented in minutes, and it can easily be measured, analyzed and reported. It puts high availability within easy reach of most companies today.
Matt Fairbanks, technical director at Veritas Software, is in charge of product strategy for high availability and storage management solutions for UNIX and Windows NT environments. Fairbanks joined Veritas in 1996 and has held various product management and international marketing positions in the areas of data protection, high availability, and network systems management. Fairbanks received his MBA from Southern Methodist University in Dallas.