However, network recovery really begins with prevention. While there are an abundance of technologies available at the software level to help network managers maximize uptime of their servers, there exists only one true standards-based hardware-level approach to maximizing high availability of server hardware health – the Intelligent Platform Management Interface (IPMI).
Set-up, monitor, manage, maintain, and recover are the stages that make up the typical IT lifecycle. Each stage plays a critical role in reducing downtime.
Set-up the network correctly, and you hope to avoid any serious problems to come. Monitor the network appropriately and you can actually catch issues as they arise, or better still, anticipate problems based on specific patterns of events. Manage the network correctly, and you can avoid outages from “bad” administration practices. Maintain the network effectively and reduce the occurrence of similar problems in the future.
All these precautions add up to reducing downtime, thus minimizing the need for network recovery in the first place.
Along the way, the IPMI standard has emerged to aid IT in preventing or minimizing downtime at each of these stages. With significant advantages over other technologies, IPMI has already established a major presence in the IT community.
An Overview of IPMI
IPMI was first made available as a standardized specification in 1998 by Dell, HP, Intel and NEC. They continue to promote IPMI today via the IPMI Forum. The IPMI specification has been adopted by more than 170 vendors worldwide. Vendor products range from servers and blades to telecommunications equipment to network equipment and storage devices. There are also card level component vendors, silicon chip and software vendors among the IPMI adopters. IPMI has matured into its own market, offering autonomous hardware-level remote monitoring, management, and recovery.
At its lowest level, IPMI is implemented on chips, or controllers, on a motherboard (sometimes referred to as a baseboard). This Baseboard Management Controller (BMC) is used with IPMI firmware to create the basis for a embedded management subsystem. This subsystem performs independent of the CPU, BIOS and the operating system (OS). This autonomous approach means that it works regardless of the state or type of OS – no matter if it’s Linux, Windows, etc. And, IPMI works even if the main CPU is dysfunctional – so long as AC power is supplied to the system. These characteristics remove limitations encountered with OS-dependent management agents.
IPMI is also a messaging protocol that defines how communications takes place to monitor system hardware, control system components, and retrieve hardware event logs and more. IPMI also describes how multiple embedded management controllers collaborate. In addition to providing remote monitoring and management capabilities that aid in prevention of downtime, by retrieving event logs and component information, additional trouble-shooting tools / processes can be used. For example, during component failures, by using field replaceable unit (FRU) information the correct failing part is identified – enhancing the “diagnose-before-dispatch” process. This improves “serviceability” and reduces field maintenance time and costs, reducing mean time to repair (MTTR).
When IPMI is integrated with system management software running on an OS, and combined with the capabilities of management software running on top of an OS, it allows users to take advantage of the software’s management features while still leveraging IPMI’s last mile management capabilities embedded in the hardware. This offers the best of both worlds at no additional cost.
While reliability, availability and serviceability (RAS) were issues addressed early on in the IPMI specification, the latest IPMI version, 2.0, also addresses operational costs. For instance, during a system restart, IPMI’s serial-over-LAN feature enables a remote system manager to watch various components (like a RAID controller) go through their power-on self test (POST). In many cases, remote managers can interact with these processes to make configuration adjustments, run additional diagnostics, etc.
In February 2004, IPMI v2.0 was unveiled. While retaining all the same features and benefits that IPMI v1.5 delivered, IPMI v2.0 added timely features:
- Enhanced security: allows support for new authentication (SHA-1 and HMAC-based) and encryption (AES and RC-4) mechanisms and options
- Additional standardized interfaces: a standardized way of remotely viewing the BOOT or Emergency Management consoles, to diagnose and repair server-related issues
- Enhanced support for modular systems like Blades: reports status of blades during hot-swap, built-in redundancy – useful for Advanced Telecom Computing Architecture (AdvancedTCA) products. Blade partitioning restricts management to known interfaces – all increase reliability and security of Blade designs
IPMI Aids in Network Set-Up
During initial staging and deployment, IPMI aids in the provisioning of “bare metal servers” by providing further insight to hardware components. This pre-OS stage cannot be covered by management agents that run on top of the OS. Here, IPMI can aid the provisioning process by ensuring server OS and application images are correctly designed and assembled for the exact hardware configuration of the target server. At this time, IT can also set a server’s baseline thresholds for critical components. IPMI allows IT to set thresholds for items such as the fan speed, temperature, voltage, etc., and also pre-configuring actions and alerts that are required.
These thresholds all form the foundation for monitoring component activity, performance and lack thereof. With baselines set, IT can use IPMI to set alerts for components that exceed thresholds, with the alerts sent to a remote console, pager or via email.
Using IPMI to Monitor and Avoid Downtime
Once a server is deployed, IT can use IPMI to pro-actively monitor the health of components and ensure pre-set thresholds are not being exceeded. This aids IT in maintaining uptime by avoiding outages altogether. The autonomous implementation of IPMI ensures that even catastrophic system/OS failures do not affect the ability of IPMI to communicate and/or control recovery features. Messages can still be sent to dispatch technicians while IPMI is able to monitor and control other system components to minimize overall system impact. IPMI’s predictive failure capabilities add flare to aid in IT lifecycle management as well. By examining the system event log, predicting failing components can be more easily determined.
This real-time view of the health of a server compliments other monitoring applications that are dependent on the OS. An administrator can view a server’s operational status via a centrally located console. The IPMI monitoring application communicates directly with the BMC so it is always available to view event logs and sensor information.
IPMI’s monitoring features also provide security measures. Chassis intrusions can be detected by configuring IPMI to detect such infringements. And, the use of multi-layer privileges and passwords lets various levels of IT personnel manage who can access specific IPMI features for their management responsibilities.
Managing, Maintaining and Recovering a Server with IPMI
As mentioned previously, traditional and / or proprietary management applications run on top of the OS and are therefore dependent on the OS being up and running. But what happens when the OS is not responding? During such scenarios, IT is left scrambling to get someone in front of the failed server. This problem can easily be compounded when the failed server is at a remote site. Not only does IT have to get someone in front of the server to diagnose the problem, they have to find the person at the remote site first. All these obstacles increase downtime.
But again, prevention goes a long way toward avoiding scenarios such as this. Let’s assume an “environmental event” occurs, such as the temperature rising in a server. The IPMI-enabled system would immediately send an alert (threshold exceeded) to the pre-identified remote console or to a designated email address or pager. This is again regardless of whether the OS is running or not. Because IPMI sends the alert prior to the crash, the server administrator can inform staff to perhaps replace a part or schedule a shut down.
In addition to such management and recovery features, by using additional features like console redirection, administrators can view progress of the server’s BOOT phase, and if necessary run BOOT diagnostics. Eliminating the need to physically visit the server minimizes downtime and reduces the time IT has to spend recovering the network.
For most IT administrators who are looking to support multiple IPMI servers, or who might want to share the management responsibility across many servers, consider using a dedicated IPMI appliance. By forcing administrator access to IPMI servers via a dedicated device in the rack, you can offer additional security and control as well as aggregate alerts/events across multiple servers.
IPMI Is a Vital Tool for the Entire IT Lifecycle
With IPMI helping the way IT manages the entire IT lifecycle – set-up, monitor, manage, maintain and recover – IT can also realize an effective way to reduce ongoing operational costs by minimizing downtime and recovering quickly from outages. In 2003, some 30 percent of servers that shipped had IPMI pre-integrated. Nearly 70 percent of servers shipping by the end of 2004 will have IPMI pre-integrated. IPMI is fast fulfilling its promise of offering an effective and cost-effective method for IT to fill the manageability gaps left by existing traditional and/or proprietary software to completely cover their IT lifecycle management requirements.