Once IT was just another important business resource. Today IT is the business for many companies. Without it, most organizations would be incapable of serving customers, collaborating with partners, developing new products, or performing other basic business functions. As a result, data center availability has become an essential precondition to competitiveness and profitability. Yet despite their best efforts to achieve “five nines” (99.999 percent) availability, businesses remain vulnerable to a variety of threats. Chief among them are issues affecting electrical power systems. Data centers rely on a continuous supply of clean electricity. However, anything from a subtle power system design flaw to a failure in the electrical grid can easily bring down even the most modern and sophisticated data center.
Business Process Management Practices
1. Break down organizational barriers
At most companies, two separate organizations contribute to data center management: IT and facilities. This divided organizational structure, long the norm among large businesses, often results in poor communication between the people responsible for maintaining workloads and the people responsible for delivering power to them. Today’s massive server infrastructures are growing larger, hotter, and more power-hungry all the time. Moreover, widespread adoption of blade servers and virtualization has only accelerated these trends. In today’s data centers, moving workloads or hardware around without consulting a facility’s engineer could result in overloaded electrical feeds or overwhelmed HVAC systems, which could bring down critical systems.
Recommendation: To decrease the incidence of powerrelated downtime, businesses should establish clearly defined and documented procedures for how and when IT managers and facilities managers consult with one another before implementing data center modifications.
To further facilitate communications, companies should also consider changing their organizational chart so IT and facilities report up to the same C-level executive. This can make enforcing interaction between IT and facilities personnel easier by subjecting both organizations to a common set of expectations and a common reporting structure.
2. Focus on long-term value rather than short-term costs
At many companies, short-term and long-term priorities are in conflict during the construction or renovation of a data center. Senior executives generally urge the people responsible for building data centers to hold down costs and shorten completion times. As a result, supply chain participants, engineers, contractors, and project managers on data center construction projects tend to make equipment selections based on who submitted the lowest bid and promised the quickest delivery.
The people responsible for operating data centers, however, have a different set of priorities that are often better aligned with the company’s long-term interests.
Recommendation: Executives with review and decisionmaking authority over a data center construction or renovation project should clearly communicate the importance of adhering scrupulously to original operating specifications, even if it means spending a little more during the construction process. Rewarding construction teams for taking a long-term approach to procurement can lessen their incentive to cut corners in ways that adversely impact availability over a data center’s lifespan.
3. Adopt standardized facilities work processes
IT departments are increasingly utilizing standardized best practice frameworks to help them systematize and enhance their work processes. Organizations that follow ITIL guidelines usually enjoy better control over IT assets, enabling them to more easily diagnose and address IT outages.
Recommendation: Facility departments should take steps to develop standardized, documented processes. Performing essential activities in consistent, repeatable ways can significantly lower the likelihood of power and cooling breakdowns while simultaneously increasing the productivity of facilities technicians.
4. Maintain a facilities change management database
Carefully track all changes to IT resources in a configuration management database (CMDB). Information in the CMDB can help IT employees resolve service interruptions more effectively, and it can be especially valuable in emergency situations when accessing important data in a timely manner is critical.
Recommendation: Facilities departments should establish and rigorously maintain a CMDB of their own. ITIL guidelines offer a useful starting point for such an initiative, and companies can also draw on a variety of specialized CMDB software applications.
5. Consider ease of repair along with reliability when evaluating power system components
People often use “availability” and “reliability” interchangeably. However, the two words have related but distinct meanings. Reliability (as measured by the mean time between system failures or MTBF) is one of two key components of availability. The other is the mean time required to repair a given system when it fails, or MTTR. The formula for availability is as follows: Availability = MTBF / (MTBF + MTTR)
Recommendation: When evaluating power system components, managers should look for products that are both highly reliable and quickly repairable. In particular, they should carefully investigate how swiftly and effectively a given power system manufacturer can service its products. How many service engineers does the manufacturer employ, where are they stationed, and how rapidly can they be onsite at your data center after an outage? Is 24/7 support available? How thoroughly do service engineers know the manufacturer’s products? Do they have access to escalation resources if they can’t solve a problem themselves?
Companies should also seek out products with redundant, modular designs. Should a module fail in such a system, other modules compensate automatically, increasing the parent unit’s MTBF. In addition, replacement modules tend to be more readily obtainable than conventional components and are usually easy enough for as few as one or two technicians to install quickly, often without manufacturer assistance. The result is lower MTTR, and hence better availability.
6. Implement enterprise-wide monitoring and proactive diagnostics
Contrary to popular belief, few systems fail without warning, except in disasters. It’s just that their warnings too often go unheeded since the monitoring systems in place are reactive in nature. For example, imagine that a UPS fails late one night, bringing the data center down with it. Odds are good that in the days or hours leading up to the failure, the UPS was emitting signals suggestive of future trouble.
Recommendation: The latest enterprise management products can help businesses monitor and proactively administer mission- critical equipment, including power, environmental, and life/ safety systems. While deploying power system monitoring and diagnostic software is an important start, facility departments must also ensure that they have disciplined work processes in place for consulting that software and responding swiftly to signs of danger. Figure 2: The latest enterprise management applications give IT and facilities a single, Web-based view of power consumption and thermal signatures. They can also proactively alert operators and facility managers if power system components are in danger of exceeding energy and temperature thresholds.
Electrical Power System Practices
7. Create holistic contingency plans
Every data center has critical dependencies on external providers of electricity, fuel, and water. Every external provider is virtually guaranteed to experience a service interruption at some point in time.
The only question is whether or not one is prepared for the crisis when it occurs. In the case of a power outage, those plans typically involve utilizing a diesel-powered generator until electrical service is restored. But what if the 24- to 48-hour supply of diesel fuel runs out before the electricity comes back?
Recommendation: Even the most well-designed facility is vulnerable to problems beyond an organization’s control. Businesses, therefore, must think comprehensively about external issues that could impact their data centers and carefully weigh the costs and benefits of preparing for them.
For example, stockpiling enough diesel fuel and water for chillers for five days instead of two may be expensive, but it’s significantly less costly than three days of downtime. This example may not be applicable for every business, but the chances of losing power for more than 48 hours may be greater than you think. When a massive ice storm struck New England and upstate New York in December 2008, more than 100,000 customers were still without power nearly a week later.
8. Adopt a power system topology appropriate to your requirements
Power system topology has a major impact on procurement costs, operational expenses, reliability, and average repair times. The more redundancy you build in, the more it will cost you to build and run, but the faster it will recover from an outage.
The Uptime Institute, an independent research organization that serves owners and operators of enterprise data centers, has defined four power system topologies for mission-critical facilities that illustrate this principle in the table below.
Recommendation: There is no single correct answer when it comes to selecting a power system topology. Organizations should match their power system topology to their particular circumstances and needs. For example, a Tier II topology might be fine for a data center that hosts a Web application, assuming multiple back-up sites are available, because users are unlikely to complain if they occasionally encounter a few seconds of latency. On Wall Street, however, a few seconds of latency can result in millions lost, so a data center that hosts a financial trading application would be wise to utilize a Tier IV topology.
9. Replace outdated equipment
Typically, data centers utilize UPS equipment to protect against power anomalies. Such systems cleanse “dirty” electrical systems and provide emergency power during outages. Until recently, however, the most highly available double-conversion UPS systems tended to be the least efficient with respect to power consumption, and vice versa. As a result, organizations looking to hold down operating costs may have implemented energyefficient UPS products that delivered below-average availability, while organizations more concerned about uptime deployed highavailability UPS systems that wasted electricity.
Recommendation: Proven UPS technology available today enables organizations to enjoy both high availability and high efficiency in a single unit. Companies using older UPS technology should consider upgrading to this newer generation of devices to increase application availability and reduce total cost of ownership simultaneously.
10. Audit your power systems
Most data center managers think they know what their power systems are capable of delivering. Far fewer, however, actually know. That’s because most businesses fail to audit their power infrastructure on a regular basis. Only by auditing power systems and operational processes can one establish the data center’s maximum load parameters concretely. Relying instead on product specifications and contractor assurances leaves you at risk of exposing capacity shortfalls the hard way, when you need to put important new IT workloads into production but can’t due to insufficient power.
Recommendation: Audit your power systems thoroughly and regularly.
Maintaining availability in today’s large, hot, and complex data centers is more difficult – and more strategically vital – than ever. Organizations can mitigate their exposure to downtime by adopting the proven best practices discussed here. Some such practices admittedly require incremental investments in new hardware or software. However, many are as simple as getting IT and facilities personnel talking to one another.
Dr. Kenneth Uhlman, PE, is the director of data center business development for Eaton Corporation where he is responsible for Eaton’s global data center strategy. He focuses on improving efficiency, availability, and business service management for data centers, including the convergence of IT and facilities. For more information visit www.eaton.com/powerquality.