When it rains, it pours.
At least that was the case in June 2006, when torrential rains flooded the Washington headquarters of the Internal Revenue Service (IRS) with an estimated 5.5 million gallons of water – a deluge that was an unwelcome companion to the building’s maintenance and electrical systems. The downpour submerged equipment in the sub-basement of the Office of the Chief Counsel in more than 20 feet of water, creating more than a troublesome business scenario for the agency and chief counsel employees.
Left with disrupted routines, more than 2,400 displaced personnel, and $54 million of damage including submerged data servers and infrastructure destruction, the IRS was tasked with keeping continuity-of-operations (COOP) afloat during the restoration process – a charge that is attracting growing attention across the board among federal agencies and private sector businesses alike.
In addition to infrastructure destruction and employee safety concerns, disasters often leave companies grasping for access to critical information and maintenance systems, without which business operations would collapse. As the business landscape continues to shift to accommodate a mounting reliance on information technology, organizations seeking to develop comprehensive yet adaptable COOP plans must take into account several key elements with regard to IT systems.
While the IRS was highly successful with disaster recovery and COOP efforts during and following the flood, the agency met with several unexpected challenges along the way. This scenario is not only a glimpse of disaster recovery best practices, but also offers lessons learned for organizations struggling to create comprehensive, reliable, and efficient continuity of operations plans.
From Sink to Swim
As with any business disaster scenario, primary concerns for the IRS chief counsel were ensuring the safety of employees and the survivability of critical business functions, which in this case included protecting taxpayer data stored in IT systems and minimizing the disruption of computer operations. At the time of the flood, the chief counsel had a comprehensive COOP plan in place, which consisted of several elements that came into play immediately as the IT team set out to ensure access to sensitive data and re-establish business operations for chief counsel employees.
The chief counsel practiced COOP procedures regularly each year, engaging scenarios including headquarters lockdowns and participating in the government’s "Forward Challenge ’06," an exercise mobilizing the COOP plans of government agencies. Included in the agency’s procedures were a comprehensive emergency contact process, failover sites and tape storage for system redundancy, and provisions for setting up temporary work sites. But a disaster of this type and scope was not anticipated.
The first priority for the chief counsel IT team following the initial flooding was to get critical equipment out of the headquarters building as quickly as possible and reconstitute essential communications and data structures for employees. To jumpstart the process, they relied on the agency’s clearly outlined emergency contact process, which had been developed and deployed prior to the flood. The team was able to contact the appropriate executives and decision makers immediately to determine next steps for restoring operations for chief counsel employees.
Fortunately, the IRS chief counsel’s existing contingency plan also included a failover site for Windows 2003 server services. This failover site was a back-up location for critical system services in Atlanta, Ga., and had been tested in April 2006.
The failover site allowed the IT team to immediately move all of the network’s Microsoft critical services to the Atlanta location, thus allowing IRS chief counsel field offices to proceed with business as usual, free from any severe impact following the flood. However, the chief counsel needed to make arrangements to get its core leadership team in Washington back online, as well as protect the portion of the agency’s data that was stored on a large Storage Area Network (SAN) supporting eight chief counsel e-mail servers located in the headquarters building.
With all of Federal Triangle under water, it became clear that the IT team would need to use a little ingenuity and improvisation to tackle the crucial and complicated undertaking of restoring the chief counsel network infrastructure quickly and effectively. Following continuity-of-operations procedures, the IT team quickly created temporary work centers at alternative locations in L’Enfant Plaza and eventually in Crystal City, Va., and soon realized that it was necessary to manually move more than 100, 80-pound servers out of the headquarters building to these temporary locations. Before doing so, the team needed to copy a massive amount of data from the SAN to local disks on the servers. Using power from a portable generator, the team moved data from the larger storage units to local drives on the mail servers – a task that required the re-wiring of a 30-amp circuit to accommodate the necessary amount of power supporting the SAN. The servers were then relocated to the temporary location in L’Enfant Plaza, where each and every computer and server was accounted for, resulting in a massive feat: the chief counsel was up and running on critical servers less than 24 hours after the downtown building closed.
Among the relocated servers was a server that provided IRS attorneys with access to data for a highly published litigation case in New York City. The team worked through the night to copy the essential data stored on the SAN onto portable hard drives, which were then transported to New York by criminal investigators to ensure that they arrived safely and assembled with no delay. The IRS was thus able to move forward with litigation without disruption, resulting in a $3.4 billion settlement.
Moreover, the IRS established an Enterprise Remote Access Project Virtual Private Network (VPN) in the wake of the flood. This VPN served as a system for use from remote locations owned by the IRS as well as from employees’ homes, thus minimizing the disruption of work processes.
The IRS chief counsel was successful in its response to this emergency, as evidenced by an audit by the treasury inspector general for tax administration that was released earlier this year, in which it was stated that "due to preparatory and responsive actions, the IRS chief counsel adequately protected sensitive data" following the flood. The elements of the agency’s existing contingency plan, such as emergency contact procedures, fail-safe data locations, data redundancy, and temporary work sites laid the foundation for success, but the agency also realized the need to improvise along the way – representing a lesson that may be useful to other businesses facing COOP planning or disaster recovery.
COOP: Realistic Preparedness
As momentum surrounding COOP and disaster recovery planning picks up, organizations are beginning to think more comprehensively about business continuity preparation. The range of events that could prompt COOP procedures is extensive and includes everything from natural disasters such as hurricanes and pandemics to terrorism and other man-made inhibitive incidences.
Any event that could harm the infrastructure where employees carry out essential business functions or keep employees from integral work sites must be considered when making business continuity plans at the core of which are technology considerations. Successful COOP planning takes all possibilities into account and addresses the key aspects of IT preparedness necessary for business continuity, some of which may not always be top of mind.
One of the bedrocks of a successful COOP plan is a clearly outlined, accurate, and accessible emergency contact process. In the midst of an emergency, communication is vital, as the IT team must be able to access key players immediately in order to move forward with next steps. In the case of the IRS, the IT team was able to jumpstart the recovery process due to its capability to contact appropriate decision makers.
Another foundation of successful COOP planning is the ability for employees to remotely access important business data securely and reliably. A strong telework system allows employees to instantly and securely access needed data anywhere, anytime, thus mitigating interruption to work functions during or following a disaster.
Of course it is essential to provide for the security requirements of such a network to protect against new vulnerabilities that often arise during times of crisis. There are many technologies on the market today that provide the gate-keeping service needed for a remote access network and should be taken advantage of when planning for COOP. The IRS was faced with the task of establishing a secure remote access network in the aftermath of the flood, and the agency is further advocating the use of telework in its COOP plans.
In addition, it is important to provide for both temporary locations for employees to access such a remote network and failover sites for data. Temporary locations ensure that while the physical infrastructure of the office may be compromised, business functions are not. A failover site, such as the IRS chief counsel site in Atlanta, provides a fallback for important data should a destructive event occur.
The IRS chief counsel also has waterproof and fireproof storage units for tapes holding redundant data in many field offices, and it backs up resources on local servers frequently, offering layers of redundancy to mitigate further the risk of lost data.
Finally, there are several "softer" aspects of continuity planning that can play a large role in ensuring that things run smoothly during recovery and continuity efforts. One of these is the practice of testing procedures and systems often. This practice can help avoid surprises in the event of a real disaster and bring visibility to any discrepancies or downfalls. It is also crucial for the IT team to understand what is important for the business of the organization and collaborate with other staff responsible for the mission of the organization, allowing them to concentrate first on the activities that are highest priority for keeping the business running.
Finally, improvisation, as the IRS chief counsel team did with power generators and the manual movement of servers, often is an unwelcome but necessary piece of disaster recovery. Recognizing this in the planning process may help teams embrace open-mindedness and be prepared to adapt for best results.
COOP needs differ based on the organization, but the elements of a realistic readiness plan are standard. COOP is an ongoing process rather than a plan that can be drawn up and placed in a book, not to be touched until an event arises. It is a living plan that must be tested, re-tested, and updated often and can determine the fate of a business in the wake of a disaster.
Brian Downs is operations section chief, Washington, D.C., for the Internal Revenue Service.
"Appeared in DRJ's Summer 2007 Issue"