While playing with my 3-year-old son, he had a minor meltdown when it was bedtime. He wanted to stay up past his normal bedtime to watch Team Umizoomi on TV. With the show recorded on DVR, missing the time slot was not the problem, but he wasn’t having any of that. We compromised and watched five minutes with an agreement we’d watch the rest another time. Preparing information for this article reminded me that like 3-year-olds, data centers can also have meltdowns – though the impact is a bit more off-putting.
Modern data centers owned by giants such as Google and Facebook are wonderful engineering marvels. They combine state-of-the-art electronics with time-proven thermodynamics to insure the server racks keep their cool. And when they don’t work according to plan, things can go bad very quickly as Amazon recently found out with the failure of its Web hosting services that took down sites such as Quora, Foursquare, HootSuite and Reddit.
The Amazon disruption is reportedly related to the performance of a network change that overloaded nodes in Amazon’s Elastic Compute Cloud (EC2) and compounded by human error. Size makes complexity and the potential for failure greater and can also attract unwanted attention.
Data centers as well as small- to medium-sized business (SMB) IT departments face daily challenges that could upset their equipment and business operations. Recent surveys have shown reliability or uptime to be the single biggest concern IT professionals worry about, and because almost every business operation is dependent on its servers and telecommunication equipment regardless of its size, there is a right to be concerned.
To highlight this issue, a recent edition of an IT industry journal focused on critical power and cooling. In one particular article, titled “Cooling Failures and Power Outages,” (Erik Schipper, Data Center Magazine, Issue 3/2011) noted a U.S. survey showing IT systems failures reduce businesses’ ability to generate revenue by 32 percent and create a lag in business even after servers recover. This can translate into anywhere from thousands to millions of dollars in lost revenue for companies.
One of the biggest concerns that can lead to data center, server, and telecom room disasters is temperature control. Modern data centers are massive and can consume almost 5 percent of the output of the average U.S. coal-fired power plant. And with power comes heat – the enemy of modern ICs. Add that to temperatures in outside environment, especially in the hot summer months where the potential to lose power and shut down the servers increases significantly. SMBs have reported their biggest challenge is found in AC systems typically designed to keep human occupants in office buildings cool are being used for electronic rack temperature management, two significantly different applications and operating conditions. AC systems are notorious for kicking off when the going gets tough, and the IT equipment rooms, former storage closets, and small offices that are not designed to keep up with such challenges are the first to suffer.
The article mentioned above noted that three of the four highest causes for data center outages are related to uninterrupted power supply (UPS) related failures, all associated with keeping things going when power is lost. The sixth leading cause was heat-related/CRAC failure with a third reporting this issue. Data center designers and managers are well aware of these challenges; they have been widely studied.
In one example Dr. Kishor Khankari, an expert in the industry, presented a paper at the summer 2010 ASHRAE conference in Albuquerque titled “Thermal Mass Availability for Cooling Data Centers during Power Shutdown.” He noted that “during power outage situations servers continue to operate by the power provided by UPS units while the supply of cooling air is completely halted until alternate means of powering the cooling system are activated. During this time servers continue to generate heat, and the server fans continue to circulate room air several times though the servers. This can result in a sharp increase in the room air temperature, bringing it to undesirable levels, which in turn can lead to automatic shutdown of servers and in some cases can even cause thermal damage to servers.”
Large, modern, state-of-the-art data centers generally have ancillary equipment to make sure they are both always on line and protected from damage. In one U.S. installation, the data center installed the capacity to generate more than 30 KWH of electrical power. However, it takes some time to start these systems and resume normal cooling operation. Dr. Khankari noted, “It is crucial to understand the rate of temperature rise of the room air during this off-cooling period and how long servers can sustain such a situation without automatic thermal shutdown.”
Dr. Khankari simulated the effects of room height, rack weight, and number of rack rows among other variables and found room temperature reached target temperatures of 95°F (35°C) in <100 seconds (<2 minutes) and 125°F (51.7°C) in <300 seconds (<5 minutes). Only in the case where heat load density was ≤100 W/sq.ft. (1076 W/m2) did room temperature stay below the target temperatures for more than five minutes. In most cases, data center cooling should be restored in less than five minutes, not a lot of time for facility managers to correct the situation. Reports from conversations with IT professionals note failures need not occur only during the incident but weeks and months after, as weakened electronics fail with apparent randomness after elevated temperature excursions.
So what can be done? Large, modern data centers are built with a vast network of sensors and back-up systems. Everything from the basics of temperature and humidity to exotic video and intrusion monitoring systems are installed in the most secure sites. Teams of security personnel and 24/7 monitoring are deployed to try to ensure everything that can prevent system outages is done. Redundant locations and fault tolerant operating systems are employed to help manage the inevitable problem. In short, millions of dollars are spent each year in disaster prevention to ensure business continuity.
For small and mid-sized businesses that rely on smaller data centers and server or telecom rooms, million dollar budgets with which to address their reliability problems are generally not in the cards. A recent IDP white paper titled “Business Risk and the Mid-size Firm: What Can Be Done to Minimize Disruptions?” described how mid-sized businesses face IT disaster and offered some ideas to help prevent or minimize disruptions. It noted while executives at these companies worry about natural disasters when they read about floods, tornadoes, or earthquakes, many business disrupting outages result from causes far less dramatic. Every day events such as construction crews cutting through a power line, an air conditioning failure, a network provider interruption, or a security issue can take systems offline. These interruptions occur more than most mid-size business managers expect and with increasingly critical impact as customers become more accustomed to accessing information and placing orders online.
Disaster Prevention Checklist for SMB Server and Telecommunication Rooms
What to Look For
What is Required
Relative Ease of Implementation
Room and server rack temperature and humidity logs
Add USB, WiFi or cellular temperature, and humidity monitoring devices
Very warm or cool areas
Temperature baseline map
Uneven HVAC balancing
Varies: Adjust registers to add additional ducting or cooling zones
Low to Medium
Easy to Medium
AC equipment, installers
Medium to High
Medium to Difficult
Excessively hot or high power usage servers, drives, etc.
Temperature monitoring devices, replacement electronic equipment where needed
Medium to High
Medium to Difficult
Inadequate electrical power to IT equipment
Electrical system reconfiguration or upgrade
Medium to High
Medium to Difficult
Frequent power outages
Medium to High
Medium to High
Despite their concerns about being ready to deal with disaster recovery, many respondents within the IDP paper said that they feel they cannot afford to prepare for disasters without exceeding their IT budget limits. This despite research showing such expenditures can reduce costs by more than 35 percent compared with unprepared centers using older technology.
Mindful of the realities of IT budgets, SMBs can take the first steps to disaster prevention. Careful attention to the cooling requirements of electronic racks, especially in older facilities with common AC ducting, can help ensure personnel comfort is not needlessly taxing server and telecommunication rooms. Low-cost, high-performance environmental monitoring systems have been on the market for several years, and they continue to improve. These include basic temperature and humidity monitoring systems to start and the ability to add sensors as they are needed. Implementing a modest sensor network can help provide a baseline of operation over time and point to hot spots or poorly ventilated spaces. Correction with adjustment of airflow or possibly a dedicated cool air zone for IT rooms can go a long way to prevent or quickly respond to problems for short money.
If more is needed, replacing older servers and electronic equipment can help head off unplanned failures and may provide the benefit of lower power usage and heat generation. While such changes require up-front expenditures, two things mitigate in favor of consideration. First, servers have become better and cheaper over time; the functionality of modern servers is often an order of magnitude greater than older devices. Second, replacement can be implemented over time, and in most cases provide a payback period within most companies’ guidelines. For example, changing out the oldest servers first will give the biggest ROI since they generally use the most power and contribute the most heat. The added benefit is the new servers will generally handle data more efficiently, so it may be possible to have one server take the place of two, more than doubling power savings. And the added reliability will often result in lower maintenance costs with the added, often difficult to quantify the benefit of improved business operations.
Some SMB’s may find adding modern, reliable UPSs can help automate the first response to power loss and give the response team time to take action before there is a complete shut-down. Judicious application to the most mission-critical operations to ensure business continuity can be undertaken at first with additional devices added over time. Before undertaking such a purchase, root cause analysis of power outages would be very helpful to understand if power loss events are not caused by some internal system. For example, does power consumption by the air conditioning system during hot, humid summer months tax the facility’s electrical systems, causing voltage fluctuations in the electronic equipment? Better understanding of such linkages can help in avoiding both problems and the added expense of systems that mask them rather than fully understand them and potentially find other, more cost-effective solutions.
In the final analysis it’s impossible to prevent data centers from being shut down, recent events have shown that even the most sophisticated systems are at risk. In the past two years Schipper noted outages at Bluehost in Provo, Utah, reportedly due to maintenance at a city-owned substation; cooling system failures at Wikipedia, Level 3 Communications; and Nokia’s Ovi, an a power failure at a Primus data center than took down Westnet, iiNet, Netspace, Internode, and TPG. For the big guys, virtualization and cloud services can help, as well as new server and data center design, and smaller, cheaper back-up power systems including green energy sources that will go a long way to making interruptions less frequent and costly.
While SMBs may be able to get some relief from back-up power systems to help prevent or minimize problems, they cannot compensate for AC outages. Again, SMBs can benefit from basic environmental monitoring and alerting systems, providing themselves with an early warning of potential problems. As summer months approach, electrical demands increase. This results in power interruptions and the associated network outages. Companies can protect themselves from further damage to equipment with reliable monitoring systems and well-thought-out response plans. The table above is a checklist of considerations to help understand and address heat-related issues that may be useful.
Things will always happen, both bad and good. When bad things get out of hand we call them disasters. Preparation is one key. Plan ahead for the potential disaster scenarios and implement strategies to respond quickly and appropriately. Heading off disasters before they happen is often considered and frequently under-resourced. From an examination of various disaster models, ideas can emerge that would allow IT professionals to receive advanced warning to help prevent, or at least mitigate, some of the worst effects that could occur.
A range of options the cost-effectiveness of each will emerge and be evaluated. Like life insurance, it may be impossible to put a hard and fast ROI on the benefit, and this is where industry data can help. In the end, balance is essential. Evaluate the risk tolerance of the organization, the ability to respond to each option, and the relative effectiveness of each option in preventing or minimizing potential disasters. Patterns will emerge. Some will find they are doing enough today, some will find small holes to fill, and yet others will discover a full range of options to pursue. Nothing is fool proof, and in the end experience will help determine which options the IT and facilities’ organizations can reasonably implement and maintain.
Harry Schechter is CEO at Temperature@lert in Boston, Mass., where his ideal temperature is PV/nR. Temperature@lert, his third tech startup, offers Schechter the opportunity to combine his greatest talents – technology innovation and problem solving. Schechter received a Bachelor of Science in Applied Science from Miami University and an MBA from MIT’s Sloan Fellows Program where his research focused on commercializing cloud-based sensor systems. For more information, visit www.temperaturealert.com.