Business continuity planners have relied on technology to assist in responding to a disruptive event. (For purposes of this article a business disruption is anything that prevents day-to-day work from being done, including power disruption, downed phone lines, and so forth. A data disaster occurs when data is corrupted. Hence, a data disaster is a subset of business disruption.)
The advent of virtualization technology has improved business continuity planning and execution for many organizations. However, virtualization technology is complex and requires proficiencies from both IT staff and management. In fact, if deployed or managed carelessly, virtualization can itself create business disruptions or data disasters.
According to Forrester Research’s 2010 report on the business state of disaster recovery preparedness, a joint effort with the Disaster Recovery Journal, many organizations have improved their disaster recovery capabilities over the past few years. Despite a slow economy, survey respondents reported an increased confidence in being prepared for a data center disaster or site failure.
Seventy-six percent of Survey respondents reported no disaster or major disruption in the past five years, yet Forrester Researcher Rachel Dines reports that companies should not take comfort in this statistic. Indeed, it should serve as a wake-up call because a whopping 25 percent of companies are likely to declare a disaster. Furthermore, business disruptions are much more common than “declared disasters.” Here’s why.
Getting an organization to declare a disaster can be a matter of perspective, according to Don Stewart, director of professional services at Ongoing Operations, a non-profit business continuity service provider for U.S. credit unions. “In some events, IT is so focused on fixing the problem that they don’t inform senior management of the disaster event,” said Stewart. Some organizations have not defined what a business disruption is, therefore senior management will hesitate to declare a disaster if the event is perceived to be minor; for instance in the case of a phone system failure, or delays in e-mail messaging.
Staying prepared requires more than having a documented business continuity plan; it requires teamwork from all stakeholders. Having a stake in planning at this level ensures that business operations would be maintained in the event of a disruption. Stewart recommends that a good plan starts with a risk impact analysis. Most companies, according to Stewart, will purchase an in-depth risk assessment and then do nothing about it; “the report just sits there with no further actions being taken.” This is as effective as making a list of essentials to pack in a kit in case of a house fire but never assembling the kit.
The Strengths of Business Continuity
Recently, the IT department of the U.S. State of Ohio virtualized their data centers that provide governmental social services to residents with developmental disabilities. The goal of the project was to provide employees and external users access to service applications without any downtime and the ability to scale for future growth. This project supports 80,000 Ohio residents.
TechTarget reported on the project and relates how the entire project took nine months of architecture planning, and before they began building the infrastructure, disaster recovery requirements were a top priority. By leveraging the experience and expertise of internal staff and by working with a qualified third party IT service company from the beginning, this project was completed on time and currently supports 200 virtual machines. More than 90 percent of the department’s servers have been virtualized, TechTarget reports. This project is an excellent example of how IT virtualization projects can work in harmony with business continuity objectives to deliver quality services.
Mercy Medical Center, Cedar Rapids, Iowa, provides a success story of having a business continuity plan in place for the entire organization. They successfully put their business continuity plan into action during the Midwest floods of 2008, and according to the hospital’s website, after three weeks the hospital returned to full operations. The Wall Street Journal’s Health Blog has a compelling interview about the plan’s evacuation and recovery process.
In summary, companies that invest the time, resources, and technology into business continuity plans are better prepared to handle business disruptions. The preceding accounts of successful recoveries affirm the value of disaster readiness.
The Weaknesses of Technology and Business Continuity
On the other hand, overconfidence in the technology that powers the business continuity recovery point and time (RPO, RTO) objectives can be dangerous. As a former data recovery engineer I can attest to the numerous questions IT administrators, business directors, or executives have regarding their technology disaster and how it happened to them.
Nobody wants a data loss or business disruption on the systems they are responsible for, yet, there is usually a cascade of technological failures that happen when IT disaster occurs. Discovering during the course of an IT disaster that backups have been failing or that backup software has not been reporting media failures is gut wrenching. Too often, a serious data loss or business disruption results in unemployment for those responsible or thought to be responsible, and the equipment is no longer viewed as reliable.
The Cascade of Failures
Disaster recovery efforts went from bad to worse for a company in Europe recently. During routine maintenance on the company’s SAN storage that housed its virtual machines, the SAN was presented to a different physical server by accident and was automatically reformatted by IT staff. This company’s disaster recovery infrastructure included an identical SAN storage unit located off-site which employed site replication technology. The IT staff thought that this event would be a minor business disruption.
The IT team was horrified when they discovered that the remote SAN was an identical copy of the primary site; the SAN’s automated site replication technology had not been disabled prior to the maintenance. Thus, when the reformat occurred at the primary site, the secondary SAN was reformatted as well.
This organization did not have any backups because it was assumed that dual storage architecture and site replication mechanisms provided complete data and system redundancy. This case is especially compelling because storage equipment features provided a false sense of security. In reality it is industry best practices combined with IT management procedures that ensure data protection.
When the Threat is on the Inside
A United States business merger suffered a disaster while the two company’s IT departments were merging their data.
The first company’s virtual host server held over 400 virtual machines across 20 storage LUNs. During the data merge, someone with administrative access to the virtual host server deleted the 400 virtual machines and their virtual disk files. Evidence suggests the disaster was caused by employee sabotage and the cause is still under investigation by computer forensic investigators.
The merging company quickly engaged emergency data recovery services and prioritized core servers that provided essential services. In three days, those systems were up and running. For the next two weeks emergency recovery efforts continued on the rest of the storage system. This required extensive recovery engineering efforts to search the unallocated areas of the storage LUN for potential virtual disk files, identifiable only by their file system attributes.
Through a combined effort of backup restoration and original volume recovery, data was recovered. Most of the virtual disk files were complete, while other virtual disks required the file contents to be extracted due to file system damage. Could your business, or your client’s business, survive without critical data or business systems for three days or more? This account highlights the importance of performing regular storage system evaluations in conjunction with annual business continuity exercises.
Too Big to Fail
Today’s IT projects have reached a point in scope where failure is not an option; there is just too much at stake for the business to experience a business disruption. Notice some of these comments and observations from IT professionals about the importance of their projects:
“We’ve been in the planning stages for three months now. I can’t tell you how many scoping and business impact analyses I have done. I don’t trust any storage, SSD, Cloud, or tape, which is why my data is stored in multiple locations. I plan for failure and have a solution to protect the data.”
— IT architect who is ready to start an e-mail migration affecting 40,000 users
“It’s a six year business intelligence project with data aggregates in the 100TB range. There’s a lot of time being spent on creating metrics and mapping the data. The raw data is going to have 30 to 40 billion rows in a single table. There’s no room for error for the team I’m working with.”
— Retail sector business analyst
“No one tests their backups. We had a job from a web hosting company who serviced twenty- to thirty thousand users. All they were doing was physically moving their server cabinet to another area within the data center and when they powered their system back on, nothing came back. After being down for more than four days they called us. Can you imagine that? Four days? The web hosting business is so competitive now, would you risk being down that long before calling a data recovery company? They told us their backups were over a year old and we recovered their original data in 14 hours. If they would have regularly tested their backups they would have known they had a risk.”
— Data recovery business owner
“RAID controller failures are the biggest support calls we deal with. These types of failures are slow to identify and big on disaster. Most IT admins do not have a plan to handle these types of events until the entire system crashes. When we support these types of calls, we do not go to the backup right away. We analyze the I/O event logs to see when the problems started. Then, through a combined effort of our replication solution and portions of other backups, we selectively restore the missing data. It’s a planned recovery execution so that recovery time objectives are met. This also helps us to meet recovery point objectives that business owners have established.”
— Business continuity service provider
Data Loss Consequences
Determining the financial impact of a business disruption is difficult because there are both tangible factors, including productivity losses, missed sales opportunities, and staff’s hourly time, but also less tangible factors to downtime such as potential non-compliance penalties, damage to corporate image, and weakened customer confidence. The Forrester-DRJ survey quoted earlier noted that 15 percent of respondents knew the cost of their business’ downtime; it averaged nearly $145,000 per hour. That is an estimate that would make any director or CIO take notice of the readiness of their business continuity plan.
A disaster recovery plan that includes the use of virtualization technology is a great start in automating the recovery process. However, the risk is that without professional guidance and testing of those plans, another disaster may be waiting below the surface of the first disaster.
In the early part of 2011, some of the biggest names in cloud services experienced disasters. These events were unrelated to each other; however the disruptions resulted in downed or disabled websites, affecting Internet users. Subscribers of these services were met with the message, “This service is temporarily unavailable.” Because cloud services are relatively new, a customer may not realize the limitations of an SLA contract until a business disruption occurs. Cloud customers may realize too late that they require more resiliencies from their cloud contracts.
During an outage, all IT hands are on deck, working feverishly to restore services, replace equipment, restore backups, and perform root cause analyses and other investigative tasks associated with management’s need to understand what triggered the event. No one really believes that a storage failure or data loss will happen to them; whether it is in the cloud or within your company’s walls. As more storage is consumed by expanding company needs or virtualization technology, additional attention must be given to the management and protection of virtual assets.
According to IDC’s worldwide tracking of external disk storage systems in 2010, total disk storage capacity shipped was over 5,100 petabytes – a 55.7 percent increase over the previous year. This continued growth requires IT management to maintain disaster recovery documentation and to exercise recovery plans regularly. This requires more than a tabletop exercise, which too often becomes little more than an updating of the plan’s emergency telephone contact list.
Maintaining business continuity by having well planned and tested disaster recovery plans are essential. Investing in the resources it takes to protect a business’ operations requires more than just policies and the latest backup equipment. Successful organizations protect their entire infrastructure and create awareness internally of the importance of business continuity. These organizations realize that any disruption within the infrastructure, regardless of how small, will have an impact on the business as a whole.
Sean R. Barry (firstname.lastname@example.org) has more than 14 years experience in the data recovery industry and has helped companies and users recover from data disasters. As a consultant, Barry provides risk analysis, data protection planning, and staff training to prevent or cope with data disasters. He also speaks at technical trade shows and other events to help educate computer users and IT staff of the impact of data loss and how to cope with data disasters.