As an organization, CNA Surety has taken the idea of business continuity seriously for a long time. A business continuity plan has been part of the organization’s culture and procedures for many years. We even staged a mock tornado to test our plan and our mettle. This test involved the entire corporate headquarters staff and was conducted with the utmost seriousness and care.
We have also conducted several successful mainframe tests at our hot site vendor and have demonstrated that our legacy systems could be restored comprehensively and quickly, but that was the limit of our disaster recovery testing. These systems remain vital to the ongoing success of our business, but they are no longer the core of our application and database infrastructure. And it wasn’t until the advent of our data warehouse and presence on the Web that the organization stopped seeing its legacy systems as its computing heart.
Again, like many small- to medium-size organizations, it was mostly a matter of finding the resources – the employee time and the funds – to create an IT DR plan when so many other competing projects clamored for their share of a limited pie.
Surety’s Computing Environment
The computing environment has burgeoned well beyond the bounds of the mainframe system during the last few years. Surety now has employees in 39 locations across the United States and Canada.
There is an IBM mainframe; an HP SuperDome; many Intel servers running three versions of the Windows operating system; a storage area network (or SAN), comprised of two different technologies, that serves all the operating systems (including HP-UX, ZOS, and Windows); substantial Oracle databases that support a data warehouse, some SQL databases that support stand-alone applications and others that provide data feeds to Web-based applications, an imaging system that is at the heart of the business, and high-volume in-house printing that is integral to the business. In addition, the AIX operating system will soon be a part of the production environment, and there is an AS/400 that is slowly being phased out. And like most other businesses’ computing configurations, many of our systems depend on one another for data.
It’s a fairly complex enterprise to have to plan to recreate in the event that all or part of it could be destroyed by an F5 tornado or made unavailable for a measurable period of time by a chemical leak or a 100-year blizzard.
A New Era
Even with the resource limitations we continue to face, we’ve turned the page in our disaster recovery planning history. We’re no longer allowing fiscal limitations to stifle our progress. With a recent hot site test, Surety has begun a new era in disaster recovery. The organization has engaged in a two-year effort that will culminate in an enterprise-wide test being conducted at a hot site provider in approximately two years.
While gearing up for this latest hot site test, we departed from our DR planning history in a few fundamental ways:
1. We are using our hot site tests to drive the DR effort. During the next two years, we are planning to conduct four hot site tests (one every six months). The tests are scheduled to become increasingly more complex and comprehensive over the next two years.
During the months between these tests, on-site testing is being conducted to verify the procedural documentation and to ensure that we have the most successful and useful hot site tests possible. In-house testing is done as much as possible to avoid the cost of doing it at the hot site vendor.
While testing at a hot site vendor can be expensive, it’s a significantly small commitment when compared to replicating a hardware infrastructure or making your applications and systems unavailable for extended periods, especially the high-availability applications.
2. When we create DR strategies, we don’t focus only on technological solutions that might costs hundreds of thousands of dollars. Instead, we consider how we can use existing equipment and even outside resources to accomplish our goals.
For example, CNA Surety’s corporate headquarters is located in Chicago, but the majority of CNA Surety’s IT resources operate out of a major operations location in Sioux Falls, S.D. In the event of a disaster, the Sioux Falls staff would find it onerous to travel to Chicago, so we have had to create a strategy for housing our operations elsewhere in the Sioux Falls area, but at minimal cost. To this end, we have begun talking with other corporations in the Sioux Falls area and local universities about sharing facilities in the event of a disaster. It means a substantial savings when compared to engaging a hot site provider, for example, to provide an alternate work site hundreds of miles away and relocating hundreds of people for several weeks or months.
3. We engaged a technical writer who has a measurable IT background. While this was a financial commitment, it is a limited one. He has helped us to establish a foundation of procedures and standards that will live well beyond his finite tenure. While he is not dedicated exclusively to DR, it has been his primary focus and he helps the disaster recovery coordinator (a systems programmer who has this title along with many others) to focus the IT staff on DR tasks.
4. We’ve set reasonable expectations. We’ve given ourselves approximately two years to grow our strategy. It’s a sufficient amount of time to develop the procedures and documentation and maintain a real sense of urgency.
The old saying is “necessity is the mother of invention.” This has been our mantra for the last year.
Servers: The backups have been aligned to perform restores according to our business’ recovery time objective. While many servers are aging rapidly, we have created procedures for restoring the operating systems and services using model images created first and then distributed to the remaining servers. We are also planning to restore the servers at an alternate work site and not the hot site vendor or parent company’s headquarters. This will mean significant travel and vendor savings.
Telephone and Printing Equipment: We are counting on a crate and ship strategy to replace our telephone and printing equipment. Using this strategy means we spare the cost of purchasing redundant hardware and engaging a vendor to perform the printing.
Workstations: We are going through the process of replacing more than 700 workstations, which means restoring these workstations from a few images and locating like hardware will be much easier. This just happens to be occurring. It’s not being driven by DR, but it is a wonderful confluence of events. However, we had prepared a plan to restore workstations from many images in order to avoid the cost of aligning all our workstation hardware and operating systems.
Alternate Worksite: To reiterate, we are working with other corporations and universities in the Sioux Falls area to establish a sort of consortium. We will all agree to house equipment and personnel in our facilities for a limited amount of time in the event of a disaster. Most corporations and universities have large meeting rooms and other facilities that are not used 100 percent of the time and could be reallocated for a limited amount of time.
There’s still much work left to be done. Again, we have a two-year plan in place that will pinnacle with an enterprise-wide test at our hot site provider. All application, data, and network components will be replicated, and we hope to conduct substantial user acceptance testing as well.
It’s going to continue to be a big challenge. During the next two years, we will have had to fully integrate our IT disaster recovery plan with our business recovery plan. Currently, neither plan references the other. Also, there are many more procedures to create, revise, and validate – with or without a technical writer, and we need to solidify our alternate worksite plan.
We must also ensure that our change management process accounts for disaster recovery. If a new application or server is implemented, it must be accounted for in the disaster recovery procedures. Depending on the impact to the existing computing infrastructure, the DR strategy may have to be amended.
There’s also the challenge of keeping the documentation current as vendors change and support personnel move in and out of roles. We’re implementing a quarterly review process we hope will help to keep the documentation alive. We’re always mindful of how much detail we should include, the user level we should be targeting. After all, as people leave and join the company, we cannot assume an intimate knowledge of the computing infrastructure. The procedures can only assume expertise in a field (like networking or operating systems) but cannot assume knowledge of every router, server, and connection between applications and databases.
We’ve created a set of standards for ensuring that operational procedures and tasks do not violate or circumvent the disaster recovery procedures. For example, backups cannot be modified to ensure that the least amount of media is used. Rather, backups must be configured to ensure that servers can be restored according to the priority established by the business.
We’re up to all these challenges though. We have an IT team, including management, that is invested in the process. We’re going to continue to use the hot site tests as mile markers to measure and prompt our progress, and we’re going to continue to be as inventive as necessary.
Duane Abbott is an IT consultant and a technical writer with Aquent, LLC. He has been in IT for more than 20 years.
Alan Carlson is a systems programmer and a disaster recovery coordinator working for CNA Surety Corporation. He has more than 30 years of IT experience.