| DISASTER
RECOVERY
JOURNAL
P. O. Box 510110
St. Louis, MO 63151
(314) 894-0276
Fax: (314) 894-7474
Internet
www.drj.com
E-mail drj@drj.com
EXECUTIVE PUBLISHER
Richard L. Arnold, CBCP
richard@drj.com
EDITOR-IN-CHIEF
Jon Seals
jon@drj.com
SENIOR
EDITOR
Janette Ballman
janette@drj.com
ASSOCIATE
EDITOR
Ed Pearce, CBCP
ed@drj.com
ASSISTANT EDITOR
Pamela Clifton
pamelaclifton@hotmail.com
COPY
EDITORS
Jim Hammill, CBCP
Richard Sandhofer
richards@drj.com
ADVERTISING
Robert Arnold
bob@drj.com
_____________
Corporate
President/CEO
Richard L. Arnold, CBCP
richard@drj.com
Vice
President
Robert Arnold
bob@drj.com
CONFERENCE COORDINATOR
Patti Fitzgerald, CBCP
patti@drj.com
CONFERENCE REGISTRAR
Merce Knese
mercedes@drj.com
CIRCULATION
Laura Baugh
laurab@drj.com
EXECUTIVE
COUNCIL
Mike Croy, Forsythe
Jeff Dato, MBCP, KPMG
John Jackson
Edward S. Devlin, E.S. Devlin & Associates
James Hammill, CBCP, JMH Consulting Inc.
Pat McAnally, SunGard Availability Services
Brian Turley, Strohl Systems
Belinda Wilson, Hewlett-Packard
INTERNATIONAL
CONTACTS
England: Thom Hetherington
Business Continuity
Phone: 0161-237-1007
thomh@tempus.demon.co.uk
Japan: Shinji Hosotsubo
Crisis Management and Preparedness Organization
Phone: 03-3519-6270
fax: 03-3519-6255
hosotsubo@cmpo.org
Brazil: José Carlos Ferreira
Disaster Recovery Mercosul
Phone and fax: 011-3666-9506
jocaff@uol.com.br
|
|
Click
Here for a Printable Version
Ounce
of Prevention Equals Pound of Cure
By STEVE ROKOV
It’s no secret network managers today must ensure the highest
availability of their networks to meet overarching business challenges.
This means network managers must ensure they have the tools necessary
to aid them in not only managing the network but in also recovering
the network when it goes down.
However, network recovery really begins with prevention. While there
are an abundance of technologies available at the software level to
help network managers maximize uptime of their servers, there exists
only one true standards-based hardware-level approach to maximizing
high availability of server hardware health – the Intelligent
Platform Management Interface (IPMI).
Set-up, monitor, manage, maintain, and recover are the stages that
make up the typical IT lifecycle. Each stage plays a critical role in
reducing downtime.
Set-up the network correctly, and you hope to avoid any serious problems
to come. Monitor the network appropriately and you can actually catch
issues as they arise, or better still, anticipate problems based on
specific patterns of events. Manage the network correctly, and you can
avoid outages from “bad” administration practices. Maintain
the network effectively and reduce the occurrence of similar problems
in the future.
All these precautions add up to reducing downtime, thus minimizing the
need for network recovery in the first place.
Along the way, the IPMI standard has emerged to aid IT in preventing
or minimizing downtime at each of these stages. With significant advantages
over other technologies, IPMI has already established a major presence
in the IT community.

An Overview of IPMI
IPMI was first made available as a standardized specification in 1998
by Dell, HP, Intel and NEC. They continue to promote IPMI today via
the IPMI Forum. The IPMI specification has been adopted by more than
170 vendors worldwide. Vendor products range from servers and blades
to telecommunications equipment to network equipment and storage devices.
There are also card level component vendors, silicon chip and software
vendors among the IPMI adopters. IPMI has matured into its own market,
offering autonomous hardware-level remote monitoring, management, and
recovery.
At its lowest level, IPMI is implemented on chips, or controllers,
on a motherboard (sometimes referred to as a baseboard). This Baseboard
Management Controller (BMC) is used with IPMI firmware to create the
basis for a embedded management subsystem. This subsystem performs independent
of the CPU, BIOS and the operating system (OS). This autonomous approach
means that it works regardless of the state or type of OS – no
matter if it’s Linux, Windows, etc. And, IPMI works even if the
main CPU is dysfunctional – so long as AC power is supplied to
the system. These characteristics remove limitations encountered with
OS-dependent management agents.
IPMI is also a messaging protocol that defines how communications
takes place to monitor system hardware, control system components, and
retrieve hardware event logs and more. IPMI also describes how multiple
embedded management controllers collaborate. In addition to providing
remote monitoring and management capabilities that aid in prevention
of downtime, by retrieving event logs and component information, additional
trouble-shooting tools / processes can be used. For example, during
component failures, by using field replaceable unit (FRU) information
the correct failing part is identified – enhancing the “diagnose-before-dispatch”
process. This improves “serviceability” and reduces field
maintenance time and costs, reducing mean time to repair (MTTR).
When IPMI is integrated with system management software running on
an OS, and combined with the capabilities of management software running
on top of an OS, it allows users to take advantage of the software’s
management features while still leveraging IPMI’s last mile management
capabilities embedded in the hardware. This offers the best of both
worlds at no additional cost.
While reliability, availability and serviceability (RAS) were issues
addressed early on in the IPMI specification, the latest IPMI version,
2.0, also addresses operational costs. For instance, during a system
restart, IPMI’s serial-over-LAN feature enables a remote system
manager to watch various components (like a RAID controller) go through
their power-on self test (POST). In many cases, remote managers can
interact with these processes to make configuration adjustments, run
additional diagnostics, etc.
In February 2004, IPMI v2.0 was unveiled. While retaining all the same
features and benefits that IPMI v1.5 delivered, IPMI v2.0 added timely
features:
- Enhanced security: allows support for new authentication (SHA-1
and HMAC-based) and encryption (AES and RC-4) mechanisms and options
- Additional standardized interfaces: a standardized way of remotely
viewing the BOOT or Emergency Management consoles, to diagnose and
repair server-related issues
- Enhanced support for modular systems like Blades: reports status
of blades during hot-swap, built-in redundancy – useful for
Advanced Telecom Computing Architecture (AdvancedTCA) products. Blade
partitioning restricts management to known interfaces – all
increase reliability and security of Blade designs
IPMI Aids in Network Set-Up
During initial staging and deployment, IPMI aids in the provisioning
of “bare metal servers” by providing further insight to
hardware components. This pre-OS stage cannot be covered by management
agents that run on top of the OS. Here, IPMI can aid the provisioning
process by ensuring server OS and application images are correctly designed
and assembled for the exact hardware configuration of the target server.
At this time, IT can also set a server’s baseline thresholds for
critical components. IPMI allows IT to set thresholds for items such
as the fan speed, temperature, voltage, etc., and also pre-configuring
actions and alerts that are required.
These thresholds all form the foundation for monitoring component activity,
performance and lack thereof. With baselines set, IT can use IPMI to
set alerts for components that exceed thresholds, with the alerts sent
to a remote console, pager or via email.
Using IPMI to Monitor and Avoid Downtime
Once a server is deployed, IT can use IPMI to pro-actively monitor the
health of components and ensure pre-set thresholds are not being exceeded.
This aids IT in maintaining uptime by avoiding outages altogether. The
autonomous implementation of IPMI ensures that even catastrophic system/OS
failures do not affect the ability of IPMI to communicate and/or control
recovery features. Messages can still be sent to dispatch technicians
while IPMI is able to monitor and control other system components to
minimize overall system impact. IPMI’s predictive failure capabilities
add flare to aid in IT lifecycle management as well. By examining the
system event log, predicting failing components can be more easily determined.
This real-time view of the health of a server compliments other monitoring
applications that are dependent on the OS. An administrator can view
a server’s operational status via a centrally located console.
The IPMI monitoring application communicates directly with the BMC so
it is always available to view event logs and sensor information.
IPMI’s monitoring features also provide security measures. Chassis
intrusions can be detected by configuring IPMI to detect such infringements.
And, the use of multi-layer privileges and passwords lets various levels
of IT personnel manage who can access specific IPMI features for their
management responsibilities.
Managing, Maintaining and Recovering a Server with IPMI
As mentioned previously, traditional and / or proprietary management
applications run on top of the OS and are therefore dependent on the
OS being up and running. But what happens when the OS is not responding?
During such scenarios, IT is left scrambling to get someone in front
of the failed server. This problem can easily be compounded when the
failed server is at a remote site. Not only does IT have to get someone
in front of the server to diagnose the problem, they have to find the
person at the remote site first. All these obstacles increase downtime.
But again, prevention goes a long way toward avoiding scenarios such
as this. Let’s assume an “environmental event” occurs,
such as the temperature rising in a server. The IPMI-enabled system
would immediately send an alert (threshold exceeded) to the pre-identified
remote console or to a designated email address or pager. This is again
regardless of whether the OS is running or not. Because IPMI sends the
alert prior to the crash, the server administrator can inform staff
to perhaps replace a part or schedule a shut down.
In addition to such management and recovery features, by using additional
features like console redirection, administrators can view progress
of the server’s BOOT phase, and if necessary run BOOT diagnostics.
Eliminating the need to physically visit the server minimizes downtime
and reduces the time IT has to spend recovering the network.
For most IT administrators who are looking to support multiple IPMI
servers, or who might want to share the management responsibility across
many servers, consider using a dedicated IPMI appliance. By forcing
administrator access to IPMI servers via a dedicated device in the rack,
you can offer additional security and control as well as aggregate alerts/events
across multiple servers.
IPMI Is a Vital Tool for the Entire IT Lifecycle
With IPMI helping the way IT manages the entire IT lifecycle –
set-up, monitor, manage, maintain and recover – IT can also realize
an effective way to reduce ongoing operational costs by minimizing downtime
and recovering quickly from outages. In 2003, some 30 percent of servers
that shipped had IPMI pre-integrated. Nearly 70 percent of servers shipping
by the end of 2004 will have IPMI pre-integrated. IPMI is fast fulfilling
its promise of offering an effective and cost-effective method for IT
to fill the manageability gaps left by existing traditional and/or proprietary
software to completely cover their IT lifecycle management requirements.
Steve Rokov is director of technical marketing for Avocent’s
manageability solutions group. He can be reached at (408) 436-6333 or
via e-mail at steve.rokov@avocent.com.
©Copyright
2005 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
|