Lessons Learned from Hurricane Sandy
- Published on February 22, 2013
- Written by Mike McClain, Senior Web Designer & Site Manager
Hurricane Sandy presented incredible IT recovery challenges due to unprecedented flooding, prolonged power outages, extensive property damage and logistical problems caused by wide-scale road closures. As a result, many mid-Atlantic and northeast companies found that the extreme conditions created situations where recovery was greatly slowed or not possible at all.
Vaults designed to protect backup tapes from theft were flooded and the tapes destroyed, backup batteries and generators sized for hours-long outages drained and ran out of gas, communications lines into data centers were severed for weeks in some cases, and wide-spread road closures and gas shortages prevented IT teams from getting to work. These conditions exceeded the worst case scenarios most companies used when planning their disaster recovery and business continuity efforts.
To put the impact of the storm into perspective, consider that Hurricane Sandy caused an estimated $65.6 billion in losses due to damage and business interruption. SunGard received 342 alerts and 117 disaster declarations due to Hurricane Sandy alone.
Three Layers are the Key
With the occurrence of severe storms and other natural disasters on the rise, what can be done to ensure a business can keep operating? These high-impact disasters change all the rules and introduce many new challenges.
SunGard views IT disaster recovery challenges as having three layers, all of which must be addressed for a successful recovery. In its work with clients during Sandy, SunGard found that taking some key steps went a long way in protecting systems and rapidly restoring them. Let’s look at the layers to successful recovery and the lessons learned from Sandy.
Layer 1: Data Protection
Disasters like Sandy bring extensive property damage and flooding. Organizations that do not have data off-site in secure facilities are in trouble. Normally, the choice of how to get data off-site—via tape, disk backup, storage replication or server replication—depends on the mission-critical nature of particular applications and the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each. In the aftermath of Sandy, many companies are revisiting their data protection strategies to make sure they have a solid plan for how to get their data off-site.
Data Protection Lessons Learned
Working with customers during and immediately after Sandy, we gained several insights that could, in future storms, prove beneficial especially to companies who use tapes and trucks as their data protection strategy.
Companies that transported tapes to the SunGard facility for recovery purposes experienced many problems due to the flooding and road closures. To overcome these problems in the future, companies may want to consider moving to disk-based backup or even real-time data mirroring and replication.
Tape backup has the longest recovery times, no surprise there. But for organizations where business requirements dictate something faster than a 24- to 48-hour recovery time, alternative methods need to be used. In particular, organizations should consider moving to managed backup or vaulting services that leverage replication technologies. This will help companies meet more stringent RTOs after future disasters.
If tape continues to be used for recovery purposes, employing best practices such as using parallel processing can significantly shorten recovery times. With such an approach, some recovery steps are done simultaneously. For example, with one tape-based customer who uses SunGard’s Standby Operating System environment as a service, we were able to configure hardware and restore operating systems while waiting for their backup tapes to arrive. This can significantly speed tape-based recovery.
Layer 2: System Recovery
Getting critical applications back up and running and the associated data restored requires a recovery environment (e.g., properly configured servers, backup hardware and software, networks and storage) that is compatible with the lost production environment.
During Sandy, many companies had problems getting back to business because changes had been made in the production environment but not the recovery environment. Some lost network connectivity and could not return to normal operations. Others had problems due to interdependencies between application elements, none of which could be restored at once.
System Recovery Lessons Learned
SunGard found that companies with the least trouble recovering had taken several system recovery precautions before Sandy struck.
Change management proved critical to guarantee that recovery and production environments matched. Before Sandy, some companies allowed production environments to grow (adding processing power and storage capacity over time), yet never updated the specifications of their recovery systems. During Sandy, almost one-third of SunGard’s customers needed significant changes to their recovery system specifications, including the need for higher end servers, additional disk capacity, and different tape technology. To eliminate potential problems in the future, as production environments change, so too must recovery environments if restoration is to proceed as planned.
Another common recovery pitfall was failing to take the interdependencies of various elements (database, middleware and web) of a critical application into account. Some customers classified the middleware and web layers of their application as Tier 1, but left the database layer as a Tier 3, and thus were not able to restore their application within their desired RTO. Going forward, companies need a full understanding of these interdependencies before a disaster strikes, and recovery plans must allow for restoration of the elements in a suitably timely manner.
Finally, there were companies that experienced network connectivity issues. SunGard advises customers to review their core network design and establish failover routes to route traffic around storm-affected areas.
Layer 3: People, Processes, and Programs
It should be obvious that the IT staff members who perform recoveries need a place to work with the right equipment, space and communications to do their jobs. They must follow set procedures in recovery runbooks and these procedures must be up to date covering any changes made that might impact restoration. Finally, the disaster recovery program must include frequent testing and analysis of plans and incorporate change management and best practices.
People, Processes, and Programs Lessons Learned
As one might expect, companies experienced problems in all three of these areas when trying to deal with Sandy’s aftermath.
With power outages and road closures across large parts of the northeast, many people attempted to work from home, only to find that they could not connect to their company VPNs or access the applications they needed in order to be productive. To avoid these problems in the future, companies should revisit their telework strategies and consider partnering with a qualified disaster recovery provider for proper workgroup space.
In the high-stress environment that follows a disaster, your recovery is only as robust as your “last-known-good” procedure. Therefore, recovery runbooks must be kept up to date and recovery procedures must be based on current production environments. Without proper change controls, many companies struggled with out-of-date recovery procedures during Hurricane Sandy.
When it came to program aspects, SunGard noted that many customers did not test their plans frequently enough, nor were post-test analyses conducted or improvements integrated. Change management between production and recovery environments was also an issue, as previously noted. Customers of SunGard’s Managed Recovery Program – like one prominent East Coast University, on the other hand – experienced a successful recovery that exceeded their expectations.
Major storms like Hurricane Sandy put new emphasis on recovery and offer a chance for organizations to take the lessons learned from the disaster and revamp their operations to ensure they will weather the next storm and come out in better shape. The key message is that businesses need to take a more proactive stance to ensure availability and uptime. In short, every business should have a solid DR plan – one that addresses all three layers of DR challenges – in place in the event of a disruption.
To learn more about the lessons SunGard learned in working with its customers before, during and after Sandy, download our Hurricane Sandy white paper here:
To learn more about SunGard’s overall Managed Recovery Program for disaster recovery, please visit: