Developing an Uptime Plan
Developing an uptime management plan provides organizations with a structured way to assess critical processes and threats, and to build a program of detection, notification, restoration, and recovery to implement when a disaster or major disruption occurs.
The National Institute of Standards and Technology (NIST) has produced a Contingency Planning Guide for Information Technology Systems which is an invaluable resource to help any organization with this goal. It outlines a seven-step approach:
- Develop the contingency planning policy statement. A formal department or agency policy provides the authority and guidance necessary to develop an effective contingency plan.
- Conduct the business impact analysis (BIA). The BIA helps to identify and prioritize critical IT systems and components.
- Identify preventive controls. Measures taken to reduce the effects of system disruptions can increase system availability and reduce contingency life-cycle costs.
- Develop recovery strategies. Thorough recovery strategies ensure that systems can be recovered quickly and effectively following a disruption.
- Develop an IT contingency plan. The contingency plan should contain detailed guidance and procedures for restoring a damaged system.
- Plan testing and training exercises. Testing the plan identifies planning gaps whereas training prepares recovery personnel for plan activation; both activities improve plan effectiveness and overall agency preparedness.
- Plan maintenance. The plan should be a living document that is updated regularly to remain current with system enhancements.
Availability – Measurement of Success
Availability is the measurement of success in achieving uptime goals — the percentage of time that a system is ready to do its assigned function. The most common metric of availability is expressed in “nines” as in “five nines of reliability.” This refers to 99.999 percent availability, which means only five minutes of unplanned downtime per year.
This metric, however, does not take into account that not all outages are alike. Is one five-minute outage a year the same as five, one-minute outages throughout the year? Is downtime at 3 a.m. the same as downtime at 3 p.m.? Answers obviously vary between organizations. Ultimately the true impact of downtime is determined by a specific user experience that factors in the revenue lost, opportunities missed, and resources used in firefighting instead of planned activities.
Several other measurements should also be considered when assessing the impact of downtime:
- Mean time to repair. You can make the system reliable, but failures will eventually happen. This measures the time from failure to recovery, once the problem is diagnosed.
- Affected users: The number of users that will experience a loss of service. Consider an outage that lasts only one minute, but affects 1,000 users versus an outage that affects one user for 1,000 minutes. Which is worse for your organization?
- Potential affected users: If not all users access the system at all times. If a 10,000 subscriber cable TV system goes out, but only 10 percent of the homes have TVs on, then the potential affected users is 10,000 but the affected number is only 1,000.
Calculating the Cost of Downtime
Downtime is expensive. A 2005 study by Infonetics Research of 80 large organizations found that overall downtime costs averaged 3.6% of annual revenue. A recent Forrester Research survey found that almost two-thirds of respondents could not even provide an estimate of downtime costs. Of those who did, 43 percent of companies estimated downtime costs at $10,000 to $100,000 per hour, and 7 percent of companies assessed it at more than $1 million per hour.
Downtime costs can come in many forms:
- Recovery Costs – the cost of replacing the damaged items over time
- Revenue Loss – the revenue that is not generated during an outage
- Productivity Loss – loss of productivity (machinery, employees, etc.) pre- and post-outage
- Loss of Future Revenue – loss of long-term revenue, loss of market share, and loss of opportunities
- Loss of Confidence – loss of confidence from customers, partners, vendors, the investment community, stockholders, and other key stakeholders
- Loss of Employment – loss of staff due to downtime (probably the most costly!)
Although it may be difficult to assess the financial implications of downtime caused by various types of events, in worst-case scenarios, when costs are compounded, the potential exists for serious economic impact to any organization.
Managing Remote Sites
Networks with remote sites present unique challenges when undertaking contingency planning and need to be carefully addressed. Remote sites are more difficult to manage than traditional sites for a number of reasons. First, geography creates obstacles. Many remote sites are in unpopulated areas with great distances between them and the staff responsible for them. Remote sites may also have difficult access points, either because they are in difficult-to-reach locations or because they have extreme security requirements, such as airports. These constraints make quickly accessing remote sites difficult for planned maintenance let alone to respond to, manage and recover from unplanned downtime.
In addition, remote sites may have a limited business function and consequently a smaller investment in resources, which tends to be overlooked in the planning process. At remote sites, standby resources may not be available to take over. Contingency resources might also be centrally located, or expensive to have in readiness, since there needs to be one spare at every location. In central sites, a small number of contingency resources can serve many production systems.
Finally, the functions of a remote site may be specific to its geography. The pump has to be where the well is. The cell tower needs to be located relative to the rest of the network. You just can’t move the functions to a hot standby site. The fix has to be at the site itself.
Downtime is inevitable. Even at five nines of availability, it is going to happen. There are, however, two ways to minimize downtime and recover faster. Typically, most – if not all – of the emphasis is aimed at minimizing the likelihood of downtime by building high availability systems. This is a reasonable but expensive course. What is often overlooked, however, is the recovery side of the downtime equation. A number of targeted initiatives can help dramatically reduce the duration of downtime occurrences.
In examining key operational failure points – equipment, connectivity, processes, and staffing – organizations can mitigate risk with built-in redundancies and automated procedures that keep downtime to a minimum. Sometimes very simple solutions prove to be cost-effective measures to reduce the likelihood of failures and to shorten their duration when things do go awry.
For starters, automating recovery operations can help an organization achieve tremendous improvements in shortening the duration of a failure. Failover switching is designed to detect trouble in active systems and automatically move operations to hot standby systems. Protection switching re-routes communications equipment to diverse routed services from alternate providers. These automatic processes can be integrated into existing management schemes like SNMP, or be standalone dedicated managers, specifically for providing rapid recovery in the event of downtime.
Standby power sources like UPS systems, solar power and self-starting generators can provide non-stop operations in the event of a utility main’s power failure. Adequate surge protection is also important. Replacing a surge protector after a lightning strike is generally much quicker and easier to do than replacing the critical equipment it is meant to protect.
In addition, automatic or remote reboot capabilities allow for a quick restart of failed equipment. In some instances, a reboot is all that is needed to restore critical systems, and it can certainly be the first line of defense in determining the severity of the problem. Reboot systems can be integrated into both AC and DC power distribution schemes.
Accessing remote sites with console port access or remote KVM systems also allows basic troubleshooting, reconfiguration and troubleshooting without having to send a technician to remote locations.
Remembering the Human Element
For any organization, the ability to meet ever-reducing recovery time objectives depends on the human element, one of the most important factors in contingency planning. When disaster strikes, no amount of hardware can replace or adequately compensate for ill-trained, under managed staff. With regular training and testing, an organization is better positioned to have the right people in place to follow the right procedures and make the right decisions to recover IT operations and ultimately return the affected system(s) to normal operating conditions. Proper investment in the human element can drastically reduce downtime and possibly prevent costly mistakes that are all too often the product of crisis thinking.
As businesses today increase their reliance on technology, including remote networks, they must manage against downtime-related revenue and employee productivity losses when disaster inevitably strikes. When downtime does occur, it can have a profoundly negative impact across the organization. To ensure your human capital, processes, systems, and data are recoverable and can support normal business operations in the event of a disaster, your network must be optimized for maximum availability. An optimal mix of planning and preparedness, cross training, redundancy in staff, and advanced equipment will ultimately help organizations drive up reliability, drive down costs, and ensure the longevity and long-term success of your organization.
"Appeared in DRJ's Winter 2008 Issue"