This article isn’t your typical “Disaster Recovery (DR) 101” dissertation, but rather a summary of conversations with organizations that made me realize how a majority of companies are not prepared or even planned for DR. Many times DR – and by default, portions of business continuity – is discussed during projects and engagements focused on many aspects of IT, including application migrations, virtualization, and security. Even when it’s not the primary discussion item, I have found that many companies have a distorted view about current DR approaches, including the expectations around the DR architecture. Even those who think they take it seriously aren’t having much success either. According to an annual vendor survey, one in four DR tests fail. One can assume the definition of “fail” is pretty extreme in this case.
So without naming clients or specific engagements, here are a few real examples of actual DR discussions over the past few years that made me realize how critical a “Disaster Recovery 101” course could be for many organizations:
‘This is the most critical application we have...’
What I found most interesting about this statement after diving into the project was how the environment wasn’t clustered or load-balanced. In fact, the company hadn’t even run a DR test in the past two years due to budget cuts. On top of that, the last data restore was a distant memory. When the conversation turned to required technology refreshes, change control, and acceptable downtime for applications, the answer was a consistent “it can’t be down.” A few weeks later, both a CPU board and a NIC failed. As a result, the downtime was measured in days and hours, not minutes. Further, the DR fail-over didn’t work and actually added to the total amount of downtime. The key take-away here is the organization’s need to not only keep an open mind regarding technology refreshes, but also need to test their environment to ensure that their DR capabilities will hold up in the event of a disaster.
‘What’s the application RTO and RPO? Blank Stares…’
Recovery time objectives (RTOs) and recovery point objectives (RPOs) are perhaps the most important key metrics when architecting a disaster recovery solution. An RTO is the amount of time it takes to recover from a disaster, and an RPO is the amount of data, measured in time that you can lose from that same disaster. So, an RTO of four means that you will be up and running again within four hours after the disaster has been declared, and an RPO of 12 means that you can restore your data/applications back to any point within 12 hours prior to the disaster declaration. These two business-driven metrics will set the stage for whether you recover from disk or tape, where you recover, and even the size of your recovery infrastructure and staff. If you don’t have the answers to these business questions, then you don’t have the answer to the IT solution.
‘We cannot lose data in a disaster for any reason…’
An RPO of zero is basically one of the most complex DR architectures. From my point of view, a zero RPO is virtually impossible given the numerous types of disasters that can occur. Whether it’s some type of storage-based replication or an application specifically designed for DR (e.g. two or three phase database/middleware commits), some data and/or transactions will be lost. I prefer to call this requirement a “near-zero RPO” so that proper expectations are set with upper management. I understand that there are those special applications (e.g. financial / stock-market) that really can’t lose anything at anytime. However, that’s the 0.01 percent exception and an example where an extreme amount of time, resources, and money have been sunk into DR, both initially and ongoing. While losing data during a disaster is mostly unavoidable, the amount of data lost can be controlled by properly implementing up-to-date technologies and testing your disaster recovery capabilities.
‘Our DR site is across the street.’
Really what this says to me is, “My business continuity site is across the street.” It is critical when operating a proper DR site that the facility is situated far away from your primary data center. While some pundits may differ on the specifics of “where,” a good rule of thumb is more than 1,000 miles away, which pretty much isolates the DR site from a majority of the potential disasters that could affect the primary site (e.g. power, weather, etc.). This distance pretty much ensures that the same type of disaster won’t affect both sites simultaneously or within a short period of time. With only a handful of power grids in the U.S., tornados and hurricanes recently affecting numerous states, and even biological and other threats which could lock down enormous geographical areas (and taking down the possibility of travel entirely), this is a recommendation I strongly suggest.
During some data center consolidation projects (e.g. reducing the overall number of data centers that a company operates), the location of the DR site becomes even more critical. A typical response when bringing up the issues of keeping your DR site too close to your data center is “We like our DR site close by so we can get staff quickly to the facility.” That’s an approach that is prone to failure. A remote site, at or around 1,000 miles away, eliminates this issue and helps prevent significant downtime. Do you really want to go tell your boss that his $20 million investment in DR failed because both sites were hit at the same time?
‘We must replicate to our DR site synchronously…’
I have rarely seen a true need for synchronous replication in any environment. In some cases, I have seen some proprietary applications with non-standard databases that, due to the inability to roll-back transactions, could only utilize synchronous replication for confirmed writes to the DR site. However, given the distance limitations of synchronous replication – over 65 miles or so can cause serious latency and I/O performance issues – it’s to a business continuity site not a DR site. The bandwidth issues associated with synchronously replication are easy – just throw more money into the project. However, the latency issue is tough, and the industry is still trying to figure out the best way to tackle this problem.
‘I can declare a disaster anytime I need to…’
The larger the company/corporation, the more nebulous the concept of who can declare a disaster and when. Remember that the clock starts ticking on your RTO once the disaster is declared as opposed to when the disaster occurs. When a tornado hits your data center at 2 a.m. on a Sunday, declaration is probably after the event. When the hurricane is bearing down and has now registered a category 4, declaration is probably before (hopefully, well before) the event.
Even more complicating is the discussion about a disaster affecting an entire data center versus “application level” disaster declarations. In this situation, the site might be fine, but a single application can declare its own disaster and fail-over to another site for any number of reasons. It’s fair to note that the former is much less complex than the latter. When taking into account things like internal and external IP addressing and DNS, standard operating procedures after “day one” such as backups and monitoring, and a myriad of other ongoing issues, you need to ask yourself this question: “Do I know where my application is today? And do I know how it is performing?”
While every organization has a slightly different take on DR, after seeing the investment and how seriously it’s taken, or not, I have seen too many companies waste time, effort, and resources in a DR approach that is not sound or well constructed. Remember that, according to a recent Global Disaster Recovery Preparedness survey of 250 Disaster Recovery Journal readers conducted with Forrester Research, more than a quarter of the respondents declared some type of disaster in the past five years. Think of that – one in four of your peers declared a disaster sometime between now and five years ago. So, if you find yourself questioning your DR approach, it may be time to slow down and launch a concerted effort to get back to DR basics.
Bill Peldzus is vice president at GlassHouse Technologies, specializing in strategy and development of data center, business continuity, and disaster recovery services. Peldzus brings more than 25 years of experience working in technical and leadership positions at Imation Corporation’s Storage Professional Services and StorageTek’s SAN Operations business group, as well as running multiple IT groups at CNA Insurance and Northern Trust Company. Peldzus often serves as a content expert, keynote speaker, and author in numerous IT areas of specialty.