Administrators are constantly assessing data center and network capacity needs in preparation for a potential disaster, but are often at a loss to respond in an optimal way to balance capacity and performance during an actual disaster. Disasters may be caused by a spike in demand or failure of power, cooling, or IT equipment. Mitigation plans must integrate data and applications with the available backup resources. Capacity improvements also can change or enhance disaster response and recovery. Increasing overall power efficiency and the need for dedicated precision cooling across the infrastructure are also necessary in high-availability environments to maximize the IT capacity while minimize the energy consumption of the cooling and building infrastructure. In addition, the power and cooling considerations for each data center must be factored into a comprehensive business continuity strategy to ensure satisfactory results.
This article will highlight a new tool and provide best practices to plan power and cooling support in any data center, and implement optimal capacity management as part of an effective disaster recovery strategy. The content will address questions including: can the power system provide adequate backup power during utility outages and if not, what are its actual limits relative to critical application service levels; is the power protection system capable of achieving desired levels of availability and handling future capacity levels; and will the cooling outage response plan ensure the operation of critical equipment while gracefully shutting down secondary applications?
The Tool: Data Center Infrastructure Management
Data center operators and IT managers now have a new tool in their arsenal: Data Center Infrastructure Management software. While the primary applications for DCIM solutions involve improving Power Usage Effectiveness (PUE) and Corporate Average Datacenter Efficiency (CADE) as part of overall capacity and performance planning, this tool also has a role to play in responding effectively to the loss of power and/or cooling as an important part of a disaster recovery strategy. Before exploring that role, however, it is necessary to have a basic understanding of DCIM capabilities.
DCIM solutions measure actual power consumption in data centers down to the granularity of the individual power outlet or server. This information can be used to improve power efficiency, optimize virtualized and load balanced infrastructures and make prudent choices during server refresh cycles. To simplify the implementation, most DCIM solutions eliminate the need for installing special agents or running additional wiring by supporting both the industry standard and popular proprietary protocols now used to measure power consumption. The better DCIM solutions also support advanced capabilities like auto-discovery, what-if analyses for capacity planning, building energy management system integration, temperature and cooling capacity tracking, sophisticated yet intuitive dashboards, comprehensive reporting, and more.
The best DCIM solutions offer another capability that can become quite useful during the loss of power or cooling: dynamic power optimization (DPO). DPO is normally used to achieve peak energy efficiency by migrating from today’s “always on” practice of operating servers to an “on demand” approach by working in cooperation with load-balancing or virtualization systems to continuously match server capacity with demand. DPO results in far better energy efficiency and higher IT utilization with no adverse impact on performance. The result is normally a reduction of 50% or more in total power consumed, which can help extend the life of any data center. But when performance is inevitably impacted during a full or partial outage, DPO helps IT managers make fully-informed decisions about which applications are affected and to what extent.
Here is how it works during normal operation: DPO employs a real-time calculation engine that continuously assesses server demand, taking into account both current demand and trends (the increase or decrease and at what rate), along with historical or anticipated patterns. When the engine detects an impending mismatch between anticipated demand and current capacity (whether too little or too much like temperature changes due to cooling issues or failover to generators due to a utility outages), it automatically informs the virtualization system or other system management environments to make the appropriate adjustments by either capping, throttling or powering up or down some number of servers. This process is usually automated via runbooks (standard operating procedures) that outline the specific steps involved during the ramp-up or ramp-down—from migrating applications to/from available virtual machines to adjusting cooling capacity to cover the complexity of multi-server and multi-site application environments.
The same process can also be employed to reallocate virtualized machines or shift applications to other facilities to best accommodate all critical needs. And that involves making some difficult choices—much like the triage physicians need to perform when providing care during a large-scale emergency.
Such a triage determines how much server capacity is actually needed to run 1) all mission-critical and 2) any highly-desirable applications (perhaps with diminished performance), along with how much power can be saved by shedding 3) all non-essential applications and lower storage tiers. It is also important to take into account both the power needed to run the IT equipment and the increase in temperature anticipated while operating with no or only partial cooling during the power outage.
One way to allocate available power across all servers is by power-capping every server in the data center. Using Intel Node Manager enabled servers, for example, employs processor P-states and T-states to limit CPU performance and, therefore, limit power consumption to predetermined amounts. But such an across-the-board approach may not be granular enough to deliver the desired results for mission-critical applications supporting the necessary application service levels.
A DCIM system with DPO capabilities is the ideal tool to both plan and implement partial power triage. In the planning stage, a capable DCIM system is able to determine the power required for and the heat generated by all individual applications. What-if analyses can then be performed to assess the possible trade-offs, such as keeping more applications available, but at lower service levels. Multiple scenarios can be created to address situations ranging from a brief brownout to an extended blackout.
A particularly powerful capability found in some DCIM systems is the measurement of transactions per kilowatt-hour on a per-application basis. This permits a level of granularity previously unavailable for disaster recovery planning in most data centers, especially where the applications exist in silos, thereby preventing a more holistic approach. Knowing the transactions per kilowatt-hour of every application makes it possible for IT mangers to make the appropriate performance trade-offs among the various critical and non-critical applications. (This information is also useful, by the way, for helping to improve overall efficiency in the data center during normal operation. Indeed, assessing the transactions per kilowatt-hour should now be incorporated into server selection during every hardware refresh cycle.)
As mentioned above, the DCIM’s DPO capability can then be used to implement the triage during a power outage when the ability to match capacity with critical demand becomes even more important to the business than reducing the electric utility bill. The best practice here is to employ runbooks that fully automate the many steps involved in shedding and/or shifting loads to make the transition as graceful as possible.
For example, one runbook might migrate all critical applications to a core set of virtual machines, then shut down the offloaded servers. Another runbook might simply shut down the servers being used for all (or some) non-essential applications. When the power is fully restored, a different set of runbooks can then be used to return to normal operation.
A major advantage of runbooks is that the full automation eliminates the inevitable errors (and resulting problems) associated with manual procedures. Each runbook can be tested and fine-tuned to deliver precisely the results desired, which enables them to be activated during a power outage with total confidence.
The Cooling Outage: Shedding/Shifting Load to Avert Disaster
Cooling is generally unavailable during a power outage when 100% of the UPS and/or generator capacity is being devoted to the IT equipment. But air conditioners and chillers can go down on their own, and like power, they seem to fail when they are needed (and stressed) the most. Therefore, a similar (albeit less drastic) form of triage is also required for this scenario.
The all-important consideration when performing an application triage for a cooling outage is the anticipated time-to-repair the system to restore full or partial cooling. Will it take at least a day? Better start shedding and/or shifting load now. Will it take less than an hour? If the current temperature is below the target maximum, which is revealed on most DCIM dashboards it may be possible to keep all applications running, at least for a while, by consolidating them onto fewer servers. Or the servers could be power-capped at some diminished performance level to limit the amount of heat generated. Or some applications can be moved to backup sites automatically, reducing the power needed, allowing other applications to remain unchanged.
While such a triage is not easy to do, DCIM modeling tools can be used to help with the necessary planning, for example, to optimize the placement of systems in within individual racks to balance heat generation and factor in outage scenarios. Examples include spreading out applications across the space so that heat generation is equally reduced when moving such an application to a backup site. Such optimization also serves to minimize stranded power (maximizing and equally balancing circuit and phase utilization), which helps extend the life of any data center by utilization power and space to its maximum. What-if analyses allow the various permutations and combinations of power, space and cooling considerations to be evaluated easily and accurately. Then, during the actual outage, the appropriate runbooks can be activated to automatically achieve the optimal result, minimizing service level impacts, depending on the required reduction and expected duration of the outage.
Because most data centers today operate far below the 80°F (27°C) cold isle temperature that the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) recommends, there is normally some margin available during full or partial cooling outages as long as there are no hotspots. The better DCIM solutions minimize risk during normal operation by taking constant and accurate measurements of the server inlet temperature, and adjusting the cooling accordingly. Then during an actual cooling outage, the companion DPO system is used to adjust capacity as needed to prevent any hot spots from forming.About the Author
Clemens Pfeiffer is the CTO of Power Assure and is a 22-year veteran of the software industry, where he has held leadership roles in process modeling and automation, software architecture and database design, and data center management and optimization technologies.