Double Jeopardy In A Disaster - Computing Data Center Challenges In A Pandemic WWritten by Kevin Epstein Friday, 19 October 2007 13:16
Planners are also well past second-generation problems, like terrorist attacks, sudden system outages, and man-made disasters. Again, back-up copies of critical data and systems are made more frequently, and there are processes rivaling those of the White House as to which administrators will take over in the event that primary administrators are incapacitated. (Most even designate an “Al Haig.”)
But many planners are just now beginning to come to grips with the third generation of disaster; namely, the world of sudden loss of staff and change in business. At the risk of focusing too closely on one potential cause, let’s call this the pandemic world.
In the pandemic world, the first part of the disaster happens obviously, if slowly. Staff starts to decrease. Whether it is people actually out sick, or commute-challenged due to increased health regulations, or simply afraid, there will be significantly fewer IT staff available on site to make changes to the computing data center. This adds risk to a bad situation, making the plans, infrastructure, and processes in place to deal with a first-generation or second-generation disaster possibly untenable.
Yet the staff issue is only the first part of the problem. The second part of the problem occurs when the business needs to alter its infrastructure to address the shifting worker demographic. With so many staff out or working remotely, the load on e-mail, remote access systems, and security/validation systems increases dramatically.
Ideally, the company would rebuild or reconfigure its computer data center to serve this mostly off-site workforce versus the previously in-house 9-5 workforce. But that goal runs straight into the challenges of the staff issue, making the situation exponentially harder.
Planners see the paradox. The business will need more onsite staff to reconfigure systems to make up for the lack of onsite staff … and we have a third generation “slow” disaster, where a company unravels in days or weeks instead of hours, but just as inexorably and fatally.
The solution, of course, is to plan; to start early, building an “adaptive” or rapidly reconfigurable data center, so that machines can be rapidly repurposed to meet business needs, in a remote and semi-automated way, as desired. This has the added benefit of increasing efficiency and ability to respond to first and second generation disasters more effectively during normal operations as well.
There are several ways to get to a rapidly reconfigurable data center. The traditional approaches include some combination of virtualization, automated provisioning, and remote management. Although solutions for these approaches have significantly matured over the past few years and are available today from multiple vendors, there are still many issues to work out.
First, each type of data center resource, such as servers, network, storage and software (applications), requires a different virtualization and provisioning solution, usually supplied by different vendors. To say the least, this adds complexity and requires greater degree of co-ordination to design and automate data center reconfiguration. Also, adapting and maintaining such a set up becomes harder as inevitable changes are made to the physical infrastructure over time. The same applies to managing the changes to applications.
Second, data centers are made up of heterogeneous components. Different makes and models of servers, storage equipment, networking gear, operating systems, and so on. Not all components are suitable for all purposes, and even if they were from a hardware point of view, they may not be from a connectivity point of view, namely, LAN and SAN. In other words, the ability to run any application on any server is a necessary prerequisite in a rapidly reconfigurable data center, but not sufficient. The operators also need the ability to logically “re-cable” the server to establish the right connectivity on the LAN and SAN so that the applications running on that server can communicate with other systems and access their data.
While server virtualization software help neutralize the differences in server hardware, they typically dictate a “shared everything” model for the network and storage to solve the connectivity problem. This model compromises security, violates traffic isolation requirements, and, in many cases, makes it physically impossible to achieve. Automated provisioning does not offer much to mitigate this either.
What’s needed is true server repurposing. That is, the ability to move all aspects of a server’s operational “personality” from one physical context to another. This includes the software, the network configuration, the SAN configuration, as well as the associated port configurations on the switches to which the server is physically connected. The new server may be in a different physical location, connected to a different set of LAN and SAN switches. It may even be a different make and model. It may not even be a physical server. Regardless, true server repurposing must transcend all these challenges by providing an abstraction that normalizes all the variables. It should be the logical equivalent of ripping the disks, the NICs, the HBAs, and the switch ports out from the original server/switches and reinstalling all of it at a new location. In effect, server repurposing makes use of the server, storage, and network virtualization already in place and ties them together into a simple operational framework that is focused on IT response to events and not infrastructure or process design.
With true server repurposing, you can rack once, cable once, and repurpose your servers repeatedly and effortlessly.
Here’s our thesis: planners today have the tools to meet the pandemic challenge. It may require some unconventional, out-of-the-box thinking, but that’s par for the course for IT planners anyway.
"Appeared in DRJ's Spring 2007 Issue"