Although the Commerzbank building remained standing, the blast shattered hundreds of windows and debris covered the offices and equipment within. Commerzbank’s recently implemented replication software saved its data and immediately made it available at the recovery site. Customer transactions, financial databases, e-mail and other crucial applications were secured.
“Before we deployed our new storage infrastructure, we were relying on tape for backup and restoration. With just a tape backup solution, our data was exposed and vulnerable,” said Gene Batan, Commerzbank’s vice president of information technology systems for North America. “In the event of a disaster, the integrity of our information was at the mercy of the last time we backed up to tape, and restoration time would take at least a few days. Our storage vendor provided us with real-time replication and shortened the window of recovery from a few days to a few minutes.”
Speed and reliability are equally important for a company that handles $30 billion in transactions daily. IT managers use the term “fail-over” to describe when one system fails or falters and the network shifts data or resources to a remote site. Commerzbank had immediate fail-over capability in place for its storage area network (SAN), which supported all of its Windows 2000 and Unix environments. Commerzbank mirrored its SAN to an enterprise storage system at the remote recovery site outside the city, creating a standard, immediately functional environment for its critical decision-support and transactional data.
“If our Unix and Windows 2000 platforms were still relying on a tape-based recovery solution, it would have taken weeks to fully restore the data,” says Alban Bramble, Commerzbank’s assistant vice president of Unix and Windows 2000 platforms. “Instead, because of the SRDF [Symmetrix Remote Data Facility] software we deployed, we instantly had a full mirrored copy of our data at the remote site. Our Unix and Windows 2000 environments were intact and secure, enabling us to focus on other problems that arose during the 9/11 crisis.”
Making A Quick Recovery
The process of resuming operation may be called “recovery,” but when billions of dollars in transactions are at stake, every minute of delay is costly. Instead of recovery, companies want “business continuity” from the point of failure. You have to pick up quickly after an interruption. If you don’t, orders, e-mails, inquiries, bills and faxes pile up, creating a backlog that may never be cleared.
Following Sept. 11, Commerzbank’s disaster recovery site became, by default, its primary data center. Field support personnel were on-site immediately to assist in the process of restoring Commerzbank’s unprotected data and ensure that its information infrastructure was secure. During this process, as many as eight field service staff worked together in 24-hour shifts at Commerzbank’s disaster recovery site.
The highest priority for Commerzbank was to restore from tape that portion of its mission-critical information that was not utilizing data replication. Some of the data backed up to tape from the data center at Two World Financial Center was not originally stored on a single vendor’s system, but when the crisis hit, EMC’s field service team sprung into action. Commerzbank required multiple terabytes of added capacity at its disaster recovery site in order to restore its tape-based, mission-critical information.
The storage provider was able to ship new products, configure the environment and fully restore Commerzbank’s information in less than 36 hours. Storage capacity at the back-up site was quickly doubled while the data itself was safeguarded with a second copy.
“Within 24 hours, we had all of the necessary loaned emergency equipment on site,” Batan recalls. “That was what it took to get things done. When there were no mirrored disks at the recovery site, the services team had some trucked down from Boston when no planes were flying. They had technicians and support staff finding ways to help us, and their personnel helped relieve our staff who were putting in exhaustive hours.
“We owe the success of our business continuity solution to the performance of the technology and to outstanding field service support. People were making sure the environment stayed up and it was a complicated project,” said Bramble. “People are the most important asset. They make the technology work.”
IT organizations at companies like Commerzbank have learned important lessons from the events of Sept. 11. Some of this valuable information follows.
Lesson 1: Distance is key.
Who could have guessed that bridges and tunnels could become single points of failure in the IT infrastructure? Sept. 11 changed the landscape by demonstrating that access to a second site can be restricted. The physical scope of a disaster can go far beyond the local facility, cutting off support people from the site and breaking site-to-site communications. Many people were unable to travel from their disaster recovery vaults to the recovery site because of the unforeseen – many streets, bridges, tunnels, and all airports were closed.
Lesson 2: Tape recovery is not effective.
It became clear that relying on tape as a means of backup and recovery leaves organizations vulnerable. IT people who once believed tape was “good enough” found that access to tape can be restricted or eliminated. Recovery time can be too slow for effective resumption of business processes. Even when files could be accessed and restored from tape, many were found to have degraded or to be unreliable. Restore time often stretched to five days – and the process typically had to be done more than once. Also, because tape is subject to human error, in many cases information had not been backed up or had been backed up inconsistently.
Lesson 3: All applications are critical.
E-mail has become one of the more critical communication vehicles for corporate knowledge. When the communication lines are severed, so is the stream of business. On Sept. 11, many businesses found that proposals in process, agreements for trades, and the ability to document transactions and agreements were all contained only in their e-mail systems. But more than e-mail is at stake. Today, most operations and applications are interdependent. If content of other information assets are lost in underlying or tertiary applications, that loss often affects higher-order applications such as CRM or ERP.
Lesson 4: Inconsistent backup is no backup at all.
Before Sept. 11, backing up data was a necessary task not always executed with precision or regularity. Today, it has become an imperative. Different backup schedules and strategies for different applications mean that information necessary for broad-based business processes cannot be matched up or reassembled. Also, inconsistent backups of applications significantly increase recovery time.
Lesson 5: People-dependent processes do not suffice.
In a crisis the magnitude of Sept. 11., people think of their families first, and rightly so. And even when people turned to the work at hand, many of them couldn’t get to the second site to perform their duties, due to closed roads and safety considerations. IT systems that performed best were those that could automate the task of recovery and limit the need for human intervention and manual activities such as tape transport and loading. Furthermore, fatigued, worried employees become prone to errors, leading to mistakes and extending the recovery process.
Lesson 6: Two sites are not enough.
Even with a second site, many companies were left completely exposed following the disaster – with business processes now dependent upon a single facility. With service providers overwhelmed, these companies faced the prospect of functioning below their set policy levels for protection and business continuance for an extended period of time. Clearly, information and people need to be dispersed in new ways.
Lesson 7: Companies that relied on tape or a third-party provider found in many cases they had difficulty meeting their recovery time objectives.
The reason? Disaster recovery providers plan for only a percentage of their customers to require services simultaneously. Therefore, a sudden, unexpected and massive demand on their capabilities was created by this large-scale, geographically focused event as clients tried to gain access to their finite resources at the same time.
Lesson 8: People are irreplaceable; so is information.
Facilities can be rented. Cell phones can step in for land lines. But for every organization, the ability to conduct business depends on the availability of key personnel and the critical information and systems they need to function. Once people were protected, information was the one asset businesses found they could not replace fast enough – and without it, the most diligent employees were hindered in their ability to re-establish business operations.
Lesson 9: All disasters are possible.
The reality of Sept. 11 and ensuing events has heightened the urgency to have disaster recovery plans in place to ensure business continuity. IT executives are now faced with an increased burden of responsibility for balancing the powerful need for protection with corporate fiscal and resource realities.
A New Era Of Business Continuance
Companies like Commerzbank are expanding their definition of the term “mission critical.” Having been confronted with the task of getting their businesses back up and running, companies learned first-hand what applications were crucial to basic business operations. Previously, “mission critical” was defined in terms of what applications and information could not, under any circumstances, be lost. Now, “mission critical” has been redefined and broadened to include the applications and information necessary for the customer’s business to be back up and running immediately. Applications such as e-mail have become “mission critical” based on their necessity for a business to run.
Personnel issues are also being reviewed. There needs to be a back-up team to relieve the first-responders – in data recovery as in any long project. Managers need to recognize the long-term need for fresh talent after hours of non-stop work. Sometimes the hardest work will be done days or weeks after the initial disaster and no one wants to overwork their already overtaxed IT professionals.
According to Batan, redundant systems for communications may be as vital as backup computing power and storage. Having a virtual private network or dial-in capacity to let workers dispersed by an emergency work online from anywhere is invaluable. That means companies are looking at backup equipment such as satellite telephones, just-in-time electricity generators or on-call technical support.
Business continuity needs to be treated as part of the applications development and deployment process. Although the necessity for a comprehensive business continuity solution has now become a foregone conclusion amongst IT managers, they now want to know how they can deploy a solution that is an active asset for the company. Previously, many remote sites sat empty waiting for a disaster, acting like an insurance policy – not useful until a catastrophe. Now, companies are turning to “productive protection” environments that enable remote sites to be used for application testing, database integrity checks, backups and other daily purposes that provide a return on the investment.
“With those assets in use daily, the importance of a storage network that flexibly supports nearly every computing platform is key,” said Batan. “No one can predict when a disaster will strike, nor how long the company will have to operate from an interim facility.
“We’re committed to adding more storage and creating a more robust storage network. Because of what we experienced, we really must have active sites in both locations rather than a passive site where it’s used for data recovery. We recognize our implementation was very timely.”
Today’s non-integrated management tools can pose undue security risks in networked storage architecture.
“Having a coordinated set of products from a reliable source with unparalleled service is why companies like Commerzbank can withstand the physical impact of an infrastructure disaster such as the one we experienced in the Sept. 11 attacks,” he says. “You can’t leave anything to chance.”
Since the events of Sept. 11, Commerzbank has decided to move more of its information and applications to a disk-based replication model and is greatly increasing the value and utilization of its remote site. Commerzbank is utilizing its remote replication capabilities and storage management software to run its operations across both of its sites simultaneously, from a single point of management and without interruption.
Joseph Walton is the senior vice president of global services for EMC Corporation, the world leader in storage systems, software and services.