Fall World 2017

Conference & Exhibit

Attend The #1 BC/DR Event!

Summer Journal

Volume 30, Issue 2

Full Contents Now Available!

combining_1One of the keys to any data recovery planning and implementation is selecting the right tools to recover the data. Once the key people, data, and plan have been identified, the right combination of hardware equipment is required. In virtually every situation, it makes sense to include tape as part of your equipment selection.

This may not be consistent with conventional wisdom for data recovery, but it’s true nonetheless. The disk industry took the lead in data protection following poor tape performance in the 1990s. The old perceptions about tape have lingered, but the fact is that tape reliability has improved more than 700 percent over the last 10 years and is now quantifiably more reliable and faster than the SATA disks that are typically paired with tape. This speaks to a significant point: just as disk has a key role in data protection and disaster recovery, so does tape.

A Little History

Tape had some rough years, and this is an undeniable fact. During that time, disk started assuming an important role in data protection. In most data centers, disk has been used as a step in the back-up and recovery process so that data moves from primary disk (such as Fibre Channel disk) to secondary disk (typically SATA) then to tape – disk-to-disk-to-tape (d2d2t). This was made possible in part because SATA disk is more affordable than Fibre Channel disk and much more so than solid state disk. Because SATA precedes tape in typical data protection tiering, SATA disk is considered roughly equivalent to the tape tier and used when comparing the technologies.

Best Use of Disk in Data Recovery

SATA disks’ two strengths in terms of data backup and recovery are random access of information so that specific files can be rapidly restored and a familiar interface that allows users to access files without requiring proprietary software to read the data. Disk is hands-down the best choice in tiering data, especially using SATA for the second tier of storage. The short-term use of disk in the data protection schema is a significant advantage.

Analyst studies indicate that 95 percent of data recovery occurs in the first two weeks following data creation, making disk the ideal short-term storage medium. However, disk is much more costly than tape, so tape is typically a better choice for longer-term storage. Just as shorter-term storage plays to disk’s strengths, longer-term storage plays to tape’s strengths.

For sites that manage data carefully, deduplication and remote replication to disk have some advantages, primarily through the ability of the site to restore the most recent data. Note, however, that for large quantities of data or sites affected by power blackouts, this advantage disappears. Copying terabytes of data over a WAN is time-consuming and expensive. With tape, sites that have a great deal of data to restore can move that data more quickly, even if it requires overnight transport to a failed data center.

Meet the New Tape

In spite of a common perception of disk as extremely reliable, SATA disk is, in fact, a significantly less reliable technology than tape. Given the bit error rate of each technology, it turns out that “…you are 100 times more likely to have bad data on disk than you are on an LTO-5 tape drive and 10,000 times more likely than if the data is stored on a T1000C or TS1130 drive.” New tape library technology has advanced tape’s data integrity capabilities, offering features that verify data integrity once data has been written to tape. With these technologies, automated tape libraries offer the most reliable storage tier — more reliable than SATA disk by at least two orders of magnitude.

Tape is also faster than SATA at moving a lot of data — and obviously this is important in getting an organization running again after a disaster. When comparing throughput, drive to drive, tape drives have proven to be faster than SATA drives. One of the fastest SATA hard drives in a recent benchmark study touted a SATA drive’s throughput of 157 MB/s. This throughput is significantly slower than tape, with the newly released T10000C drive posting uncompressed data throughput of 240 MB/s, and the TS1140 posting uncompressed data throughput of 250 MB/s. Tape’s burst transfer rates are even faster, and compressed data posts throughput speeds of up to 500 MB/s. This compares well with even Fibre Channel drives which support 400 up to 800 MB/s throughput speeds.

Tape is the most cost-effective storage option for data that can tolerate access times of a few minutes. This is particularly significant when administrators consider the amount of disk that is used appropriately. (See above graphic.)

Studies show that about 30 percent of the data stored on disk is appropriate, while “about 40 percent of the data is inert ... 15 percent is allocated but unused; and 10 percent is orphaned, meaning the owners of the data are no longer with the organization; and 5 percent is inappropriate.” Given that, on average, only 30 percent of data on disk is appropriately stored for wider use, the relative cost of disks is further inflated as compared with tape.

combining_2Tape’s Drawbacks

Perception is tape’s single biggest problem, although that is not to say that tape is trouble-free. In locations where legacy tape systems are in use, problems are common. In centers using current technology, tape problems are few, but tape still requires at least minimal best practices, just as all IT equipment and processes require proper care. At well managed sites, such as the National Energy Research Scientific Computing Center (NERSC), where tape is used heavily, the success rate of data restore is 99.945 percent. This includes restoring tapes up to 12 years after data is written. If nothing else, this confirms and provides proof that tape can perform very well.

The time required to restore individual files is significant if the tapes are not in the library – a classic case where using disk simply makes sense. Tape that is in the library has a latency of only a few minutes. Latency issues are largely removed with the market’s increasing adoption of active archive, where a copy of all data is kept in the library.

Networks and back-up applications are currently optimized for disk, not tape, so it is difficult to get the full value of tape’s speed advantage. This is another case where using disk as an intermediary makes sense – d2d2t, with disk between primary storage and tape.

The proprietary interface to data stored on tape is another hindrance to data restore. It is important to note that this is beginning to change, with the release of Linear Tape File System (LTFS). With this file system interface, tape can be mounted and appear as a disk drive. Users can drag and drop files without having to go through proprietary software that translates data formats to human readable versions.

Some would say that tape is untenable due to the overhead of having to migrate data to newer generation tapes or having to have copies of old operating systems and software. However, this assertion overlooks the fact that data must be frequently migrated from disk to newer disk generations. The average life of disk is about four years, with some suggesting three years as the optimal time between disk replacements. This compares unfavorably with tape, which has a typical lifespan of up to 12 years. Tape is particularly flexible in that most formats, including LTO and TS1140, can read data from tapes written with technology two generations earlier.

Analyzing Disaster Threats

In terms of the type of disasters that may occur, only 3 percent of all disasters are significant natural disasters. Although these may get the most press, they do not form the most pressing threats to data. The most common threat to data is hardware malfunction, followed by human error, software corruption, and computer viruses. Disk is vulnerable to some degree as shown in long-term studies of disk drive failures. (See above graphic.)

Protecting data from logical corruption is one of the primary uses of tape in disaster recovery, and something that disk has never really addressed. If vicious software or viral malware hits a disk, it can spread to the initial disk, the back-up disk, and RAID, which serves as the backup for the backup. It is possible that a disk-only back-up strategy can actually worsen the initial disaster situation against which it is supposed to protect.

The combination of disk and tape for disaster recovery is the strongest defense against risk. The University of Chicago’s tape investment paid for itself with a single incidence of disk failure—without tape, the data would have been permanently lost. Online providers clearly trust tape, as illustrated by Google’s recent Gmail issue. On Feb. 28, 2011, Google posted on its Gmail platform an apology to the 0.02 percent (estimated approximately 150,000) of users who could not access their e-mail. The culprit? A software bug that attacked e-mail in the disk arrays and disk backups across data centers. All copies of the e-mail were unavailable, save those written to tape. The announcement states, “To protect your information from these unusual bugs, we also back it up to tape. Since the tapes are offline, they’re protected from such software bugs.” This succinctly illustrates the importance of using tape for disaster recovery.

Best in Data Recovery Practices: Both Disk and Tape

Although much of this data counters prevailing opinion, the facts are clear that tape, when used properly with disk, offers advantages that disk alone cannot match. Given the improvements in tape and tape automation, it is simply prudent to include tape as part of a disaster plan. Tape helps to ensure data integrity, affordability, and is a smart companion to disk in long-term data protection.

Molly Rector is chief marketing officer for Spectra Logic. Rector brings more than a decade of data storage industry experience to this role, where she helps define and execute the company’s future product roadmap and overall corporate direction, and serves as its primary corporate spokesperson.

Beth Walker is a chief writer for Spectra Logic. Walker has worked at Spectra Logic for more than 15 years, writing about products and the industry, and authoring multiple white papers for the company.

There’s nothing like a hardware or database failure to ruin your day – and your reputation as an IT professional. Losing access to your data for an hour, a day, or more, can set your business back, infuriate users, and cause significant risk to your company’s reputation and bottom line. Data backups are a critical IT function that no business can afford to be without. In our knowledge-based economy, your corporate data must always be available and accessible. As your data and IT infrastructure grow, so does the vulnerability of your business. Protecting your business from that vulnerability is one of the key functions of IT.

With the growing dependence on virtualization as a way to manage IT costs and consolidate infrastructure, the amount of data at most companies is continuing to grow at exponential rates. Traditional tape-based backups simply cannot keep pace. Disaster recovery scenarios risk potentially dangerous shortcomings if the data back-up strategy is based solely on traditional tape backup.

Tape Data Backups Far From Perfect

For years, tape backups have been the least expensive way to protect your data. Although tape is a viable method for archiving and meets regulatory compliance requirements, tape backups have some disadvantages.

First, tape is mechanical by design. Its reliability is questionable, and tape backups are labor intensive and subject to human error. When information is written to tape, a thin piece of plastic ribbon is stretched, pulled, and magnetically imprinted. Repeat this process a thousand times, and it is apparent why a recent Storage Magazine survey noted that more than 60 percent of IT professionals experience tape failure at least once per week and 25 percent at least twice a week.

Second, because tape backup is a slow, labor-intensive process, it often results in longer-than-expected backup timeframes, which in-turn costs your firm productivity. At the same time, data volume is growing, and back-up windows are shrinking because businesses have no tolerance for downtime. When data backup takes too long, recovery point objectives (RPO) are at risk.

Third, data on tape can only be accessed sequentially, adding considerable time to any exploration of a data recovery point. This type of inefficient and labor intensive process ends up adding to your business overhead. Data restores also take a long time. You need to factor in the time to find the correct tape, recall the tape from your bonded offsite storage provider, catalog the tape, and then face the ultimate question ... is the tape still good or not? Software for back-up operations can have issues when interacting with media. Running multiple jobs simultaneously can cause locks, human failure to change tapes can erase historical backups, and resource contention can occur due to unforeseen events or because of a lack of planning on the part of data center operators. Let’s not forget that tape drives can cause problems with tape media.

Finally, tape backups may be incomplete. Businesses make the common mistake of thinking all their back-up policies are 100 percent, assuming complete data resides on their data backups. It often doesn’t. Business data is often left unprotected because the tape back-up process does not include infrastructure level saves, includes all application tiers or, worse, no nightly full saves. Tape backups limit your RPO. Periodic tape backup guarantees hours of lost data in the event of a disaster. If a critical system fails anytime today, the best you can do is recover data from your last backup, which could be from yesterday. In addition, any data not backed up is lost forever. This can mean a significant delay in meeting your RPO and RTO (recovery time objectives). With failed tape backups, your deliverable RPO of 24 hours can quickly become 48 or – worse – 72 hours.

Advantages of Disk-Based Backups

With business operations so reliant on a complete data backup and recovery strategy, it’s vitally important to identify alternatives to tape-based backups that can eliminate vulnerabilities and provide effective data protection that will quickly restore business operations in the event of a failure. Disk is very well suited for backup, especially now with technologies such as de-duplication which offers simplification and cost savings at a relatively low cost. Disk-based backups with block level compression and de-duplication make backups more efficient and reduce storage cost compared to traditional back-up solutions.

Data de-duplication (a specialized data compression technique for eliminating coarse-grained redundant data) is now a common feature across all disk back-up products, with practically every major enterprise data storage vendor offering at least one data de-duplication product. De-dupe takes the redundancy out of backups and back-up data. The problem de-dupe addresses is that backups consume disk space like a vacuum consumes dust.

For example, let’s say you have a terabyte of data and you back it up daily (seven times/week). That’s 7 TB of disk space, despite the fact that less than 10 percent of the data has actually changed. All backups have duplicate data from one backup to the next. With de-duplication, you can shrink that volume significantly by backing up only data that has changed since the last backup. It results in some staggering savings. In our example above, let’s assume a change rate of about 7 percent of the data in each back-up set. The first full back-up set was reduced by a modest percentage to 690 GB. Subsequent full backups were significantly reduced, given their overlap with data in the initial full to about 40 GB each. The total disk requirement for storing 7 TB of raw back-up data was a mere .970 TB. Further, backups to meet your retention policies yield even better results.

If you back up less data, you’ll achieve an automatic reduction in back-up windows. The software takes care of de-duplication at the source. Imagine telling your network administrator you could reduce traffic by 95 percent through de-dupe! Disk-based data back-up solutions have proven to be more reliable than tape, especially when implemented with a RAID system on your SAN. This has the ability to further increase storage integrity and availability through disk redundancy. With disk and de-duplication software, organizations can reduce the back-up process by several hours nightly and reduce the volume of data that has to be transported across the WAN. Combine this with cost and ease of use advantages and you’ll see why organizations are increasingly switching to disk back-up solutions.

Disk Backups Provide Optimized Recovery

Disk-based backups should be a critical component of any business disaster recovery plan. Disk backup ensures the urgent and safe recovery of data in the event of a disaster on any scale. Data security has always been a high priority issue for businesses, but storing data on back-up tapes presents significant risks. Tapes get lost, stolen, and even fall off the backs of trucks. Disk backups are more secure. By storing backup and legacy data in a data vault off-site, you significantly reduce the risks associated with same-site data storage while maintaining the ability to achieve fast, reliable access and recovery when required. Disk backup also enables compliance with corporate governance regulations and situations where liability and accountability are vital.

Summary

Tape has, in the past, been an IT firewall, the backstop for data loss in the event of a catastrophic event or disaster. There is still a need for recovery using tape in various tiers of your IT infrastructure. It shows little sign of giving up this position to spinning disk anytime soon. But the need to achieve measurable daily results and maintain greater data protection options has led to a current trend toward disk-based backups being adopted among small, medium, and large businesses alike. The capabilities, operational benefits, scalability, and cost benefits of disk back-up systems make them something every IT shop should consider. After all, you cannot buy your data on the street after a failure.

Richard Dolewski is chief technology officer and vice president of business continuity services for WTS. Dolewski is a certified systems integration specialist and disaster recovery planner and is globally recognized as a subject matter expert for business continuity for IBM iSeries and i5 environments. An author and frequent technical contributor, Dolewski is a winner of numerous speaking awards including COMMON’s Impact Award and is a member of COMMON’s Speaker Hall of Fame. Dolewski’s book titled “System i Disaster Recovery Planning” is available from MC Press and Amazon.com.

Schecter-photoWhile playing with my 3-year-old son, he had a minor meltdown when it was bedtime. He wanted to stay up past his normal bedtime to watch Team Umizoomi on TV. With the show recorded on DVR, missing the time slot was not the problem, but he wasn’t having any of that. We compromised and watched five minutes with an agreement we’d watch the rest another time. Preparing information for this article reminded me that like 3-year-olds, data centers can also have meltdowns – though the impact is a bit more off-putting.

Modern data centers owned by giants such as Google and Facebook are wonderful engineering marvels. They combine state-of-the-art electronics with time-proven thermodynamics to insure the server racks keep their cool. And when they don’t work according to plan, things can go bad very quickly as Amazon recently found out with the failure of its Web hosting services that took down sites such as Quora, Foursquare, HootSuite and Reddit.

The Amazon disruption is reportedly related to the performance of a network change that overloaded nodes in Amazon’s Elastic Compute Cloud (EC2) and compounded by human error. Size makes complexity and the potential for failure greater and can also attract unwanted attention.

Data centers as well as small- to medium-sized business (SMB) IT departments face daily challenges that could upset their equipment and business operations. Recent surveys have shown reliability or uptime to be the single biggest concern IT professionals worry about, and because almost every business operation is dependent on its servers and telecommunication equipment regardless of its size, there is a right to be concerned.

To highlight this issue, a recent edition of an IT industry journal focused on critical power and cooling. In one particular article, titled “Cooling Failures and Power Outages,” (Erik Schipper, Data Center Magazine, Issue 3/2011) noted a U.S. survey showing IT systems failures reduce businesses’ ability to generate revenue by 32 percent and create a lag in business even after servers recover. This can translate into anywhere from thousands to millions of dollars in lost revenue for companies.

One of the biggest concerns that can lead to data center, server, and telecom room disasters is temperature control. Modern data centers are massive and can consume almost 5 percent of the output of the average U.S. coal-fired power plant. And with power comes heat – the enemy of modern ICs. Add that to temperatures in outside environment, especially in the hot summer months where the potential to lose power and shut down the servers increases significantly. SMBs have reported their biggest challenge is found in AC systems typically designed to keep human occupants in office buildings cool are being used for electronic rack temperature management, two significantly different applications and operating conditions. AC systems are notorious for kicking off when the going gets tough, and the IT equipment rooms, former storage closets, and small offices that are not designed to keep up with such challenges are the first to suffer.

The article mentioned above noted that three of the four highest causes for data center outages are related to uninterrupted power supply (UPS) related failures, all associated with keeping things going when power is lost. The sixth leading cause was heat-related/CRAC failure with a third reporting this issue. Data center designers and managers are well aware of these challenges; they have been widely studied.

In one example Dr. Kishor Khankari, an expert in the industry, presented a paper at the summer 2010 ASHRAE conference in Albuquerque titled “Thermal Mass Availability for Cooling Data Centers during Power Shutdown.” He noted that “during power outage situations servers continue to operate by the power provided by UPS units while the supply of cooling air is completely halted until alternate means of powering the cooling system are activated. During this time servers continue to generate heat, and the server fans continue to circulate room air several times though the servers. This can result in a sharp increase in the room air temperature, bringing it to undesirable levels, which in turn can lead to automatic shutdown of servers and in some cases can even cause thermal damage to servers.”

Large, modern, state-of-the-art data centers generally have ancillary equipment to make sure they are both always on line and protected from damage. In one U.S. installation, the data center installed the capacity to generate more than 30 KWH of electrical power. However, it takes some time to start these systems and resume normal cooling operation. Dr. Khankari noted, “It is crucial to understand the rate of temperature rise of the room air during this off-cooling period and how long servers can sustain such a situation without automatic thermal shutdown.”

Dr. Khankari simulated the effects of room height, rack weight, and number of rack rows among other variables and found room temperature reached target temperatures of 95°F (35°C) in <100 seconds (<2 minutes) and 125°F (51.7°C) in <300 seconds (<5 minutes). Only in the case where heat load density was ≤100 W/sq.ft. (1076 W/m2) did room temperature stay below the target temperatures for more than five minutes. In most cases, data center cooling should be restored in less than five minutes, not a lot of time for facility managers to correct the situation. Reports from conversations with IT professionals note failures need not occur only during the incident but weeks and months after, as weakened electronics fail with apparent randomness after elevated temperature excursions.

So what can be done? Large, modern data centers are built with a vast network of sensors and back-up systems. Everything from the basics of temperature and humidity to exotic video and intrusion monitoring systems are installed in the most secure sites. Teams of security personnel and 24/7 monitoring are deployed to try to ensure everything that can prevent system outages is done. Redundant locations and fault tolerant operating systems are employed to help manage the inevitable problem. In short, millions of dollars are spent each year in disaster prevention to ensure business continuity.

For small and mid-sized businesses that rely on smaller data centers and server or telecom rooms, million dollar budgets with which to address their reliability problems are generally not in the cards. A recent IDP white paper titled “Business Risk and the Mid-size Firm: What Can Be Done to Minimize Disruptions?” described how mid-sized businesses face IT disaster and offered some ideas to help prevent or minimize disruptions. It noted while executives at these companies worry about natural disasters when they read about floods, tornadoes, or earthquakes, many business disrupting outages result from causes far less dramatic. Every day events such as construction crews cutting through a power line, an air conditioning failure, a network provider interruption, or a security issue can take systems offline. These interruptions occur more than most mid-size business managers expect and with increasingly critical impact as customers become more accustomed to accessing information and placing orders online.

Disaster Prevention Checklist for SMB Server and Telecommunication Rooms

What to Look For

What is Required

Relative Cost

Relative Ease of Implementation

Room and server rack temperature and humidity logs

Add USB, WiFi or cellular temperature, and humidity monitoring devices

Low

Easy

Very warm or cool areas

Temperature baseline map

Low

Easy

Uneven HVAC balancing

Varies: Adjust registers to add additional ducting or cooling zones

Low to Medium

Easy to Medium

Inadequate cooling

AC equipment, installers

Medium to High

Medium to Difficult

Excessively hot or high power usage servers, drives, etc.

Temperature monitoring devices, replacement electronic equipment where needed

Medium to High

Medium to Difficult

Inadequate electrical power to IT equipment

Electrical system reconfiguration or upgrade

Medium to High

Medium to Difficult

Frequent power outages

UPS(s), installers

Medium to High

Medium to High

Despite their concerns about being ready to deal with disaster recovery, many respondents within the IDP paper said that they feel they cannot afford to prepare for disasters without exceeding their IT budget limits. This despite research showing such expenditures can reduce costs by more than 35 percent compared with unprepared centers using older technology.

Mindful of the realities of IT budgets, SMBs can take the first steps to disaster prevention. Careful attention to the cooling requirements of electronic racks, especially in older facilities with common AC ducting, can help ensure personnel comfort is not needlessly taxing server and telecommunication rooms. Low-cost, high-performance environmental monitoring systems have been on the market for several years, and they continue to improve. These include basic temperature and humidity monitoring systems to start and the ability to add sensors as they are needed. Implementing a modest sensor network can help provide a baseline of operation over time and point to hot spots or poorly ventilated spaces. Correction with adjustment of airflow or possibly a dedicated cool air zone for IT rooms can go a long way to prevent or quickly respond to problems for short money.

If more is needed, replacing older servers and electronic equipment can help head off unplanned failures and may provide the benefit of lower power usage and heat generation. While such changes require up-front expenditures, two things mitigate in favor of consideration. First, servers have become better and cheaper over time; the functionality of modern servers is often an order of magnitude greater than older devices. Second, replacement can be implemented over time, and in most cases provide a payback period within most companies’ guidelines. For example, changing out the oldest servers first will give the biggest ROI since they generally use the most power and contribute the most heat. The added benefit is the new servers will generally handle data more efficiently, so it may be possible to have one server take the place of two, more than doubling power savings. And the added reliability will often result in lower maintenance costs with the added, often difficult to quantify the benefit of improved business operations.

Some SMB’s may find adding modern, reliable UPSs can help automate the first response to power loss and give the response team time to take action before there is a complete shut-down. Judicious application to the most mission-critical operations to ensure business continuity can be undertaken at first with additional devices added over time. Before undertaking such a purchase, root cause analysis of power outages would be very helpful to understand if power loss events are not caused by some internal system. For example, does power consumption by the air conditioning system during hot, humid summer months tax the facility’s electrical systems, causing voltage fluctuations in the electronic equipment? Better understanding of such linkages can help in avoiding both problems and the added expense of systems that mask them rather than fully understand them and potentially find other, more cost-effective solutions.

In the final analysis it’s impossible to prevent data centers from being shut down, recent events have shown that even the most sophisticated systems are at risk. In the past two years Schipper noted outages at Bluehost in Provo, Utah, reportedly due to maintenance at a city-owned substation; cooling system failures at Wikipedia, Level 3 Communications; and Nokia’s Ovi, an a power failure at a Primus data center than took down Westnet, iiNet, Netspace, Internode, and TPG. For the big guys, virtualization and cloud services can help, as well as new server and data center design, and smaller, cheaper back-up power systems including green energy sources that will go a long way to making interruptions less frequent and costly.

While SMBs may be able to get some relief from back-up power systems to help prevent or minimize problems, they cannot compensate for AC outages. Again, SMBs can benefit from basic environmental monitoring and alerting systems, providing themselves with an early warning of potential problems. As summer months approach, electrical demands increase. This results in power interruptions and the associated network outages. Companies can protect themselves from further damage to equipment with reliable monitoring systems and well-thought-out response plans. The table above is a checklist of considerations to help understand and address heat-related issues that may be useful.

Things will always happen, both bad and good. When bad things get out of hand we call them disasters. Preparation is one key. Plan ahead for the potential disaster scenarios and implement strategies to respond quickly and appropriately. Heading off disasters before they happen is often considered and frequently under-resourced. From an examination of various disaster models, ideas can emerge that would allow IT professionals to receive advanced warning to help prevent, or at least mitigate, some of the worst effects that could occur.

A range of options the cost-effectiveness of each will emerge and be evaluated. Like life insurance, it may be impossible to put a hard and fast ROI on the benefit, and this is where industry data can help. In the end, balance is essential. Evaluate the risk tolerance of the organization, the ability to respond to each option, and the relative effectiveness of each option in preventing or minimizing potential disasters. Patterns will emerge. Some will find they are doing enough today, some will find small holes to fill, and yet others will discover a full range of options to pursue. Nothing is fool proof, and in the end experience will help determine which options the IT and facilities’ organizations can reasonably implement and maintain.

Harry Schechter is CEO at Temperature@lert in Boston, Mass., where his ideal temperature is PV/nR. Temperature@lert, his third tech startup, offers Schechter the opportunity to combine his greatest talents – technology innovation and problem solving. Schechter received a Bachelor of Science in Applied Science from Miami University and an MBA from MIT’s Sloan Fellows Program where his research focused on commercializing cloud-based sensor systems. For more information, visit www.temperaturealert.com.

This article isn’t your typical “Disaster Recovery (DR) 101” dissertation, but rather a summary of conversations with organizations that made me realize how a majority of companies are not prepared or even planned for DR. Many times DR – and by default, portions of business continuity – is discussed during projects and engagements focused on many aspects of IT, including application migrations, virtualization, and security. Even when it’s not the primary discussion item, I have found that many companies have a distorted view about current DR approaches, including the expectations around the DR architecture. Even those who think they take it seriously aren’t having much success either. According to an annual vendor survey, one in four DR tests fail. One can assume the definition of “fail” is pretty extreme in this case.

So without naming clients or specific engagements, here are a few real examples of actual DR discussions over the past few years that made me realize how critical a “Disaster Recovery 101” course could be for many organizations:

‘This is the most critical application we have...’

What I found most interesting about this statement after diving into the project was how the environment wasn’t clustered or load-balanced. In fact, the company hadn’t even run a DR test in the past two years due to budget cuts. On top of that, the last data restore was a distant memory. When the conversation turned to required technology refreshes, change control, and acceptable downtime for applications, the answer was a consistent “it can’t be down.” A few weeks later, both a CPU board and a NIC failed. As a result, the downtime was measured in days and hours, not minutes. Further, the DR fail-over didn’t work and actually added to the total amount of downtime. The key take-away here is the organization’s need to not only keep an open mind regarding technology refreshes, but also need to test their environment to ensure that their DR capabilities will hold up in the event of a disaster.

‘What’s the application RTO and RPO? Blank Stares…’

Recovery time objectives (RTOs) and recovery point objectives (RPOs) are perhaps the most important key metrics when architecting a disaster recovery solution. An RTO is the amount of time it takes to recover from a disaster, and an RPO is the amount of data, measured in time that you can lose from that same disaster. So, an RTO of four means that you will be up and running again within four hours after the disaster has been declared, and an RPO of 12 means that you can restore your data/applications back to any point within 12 hours prior to the disaster declaration. These two business-driven metrics will set the stage for whether you recover from disk or tape, where you recover, and even the size of your recovery infrastructure and staff. If you don’t have the answers to these business questions, then you don’t have the answer to the IT solution.

‘We cannot lose data in a disaster for any reason…’

An RPO of zero is basically one of the most complex DR architectures. From my point of view, a zero RPO is virtually impossible given the numerous types of disasters that can occur. Whether it’s some type of storage-based replication or an application specifically designed for DR (e.g. two or three phase database/middleware commits), some data and/or transactions will be lost. I prefer to call this requirement a “near-zero RPO” so that proper expectations are set with upper management. I understand that there are those special applications (e.g. financial / stock-market) that really can’t lose anything at anytime. However, that’s the 0.01 percent exception and an example where an extreme amount of time, resources, and money have been sunk into DR, both initially and ongoing. While losing data during a disaster is mostly unavoidable, the amount of data lost can be controlled by properly implementing up-to-date technologies and testing your disaster recovery capabilities.

‘Our DR site is across the street.’

Really what this says to me is, “My business continuity site is across the street.” It is critical when operating a proper DR site that the facility is situated far away from your primary data center. While some pundits may differ on the specifics of “where,” a good rule of thumb is more than 1,000 miles away, which pretty much isolates the DR site from a majority of the potential disasters that could affect the primary site (e.g. power, weather, etc.). This distance pretty much ensures that the same type of disaster won’t affect both sites simultaneously or within a short period of time. With only a handful of power grids in the U.S., tornados and hurricanes recently affecting numerous states, and even biological and other threats which could lock down enormous geographical areas (and taking down the possibility of travel entirely), this is a recommendation I strongly suggest.

During some data center consolidation projects (e.g. reducing the overall number of data centers that a company operates), the location of the DR site becomes even more critical. A typical response when bringing up the issues of keeping your DR site too close to your data center is “We like our DR site close by so we can get staff quickly to the facility.” That’s an approach that is prone to failure. A remote site, at or around 1,000 miles away, eliminates this issue and helps prevent significant downtime. Do you really want to go tell your boss that his $20 million investment in DR failed because both sites were hit at the same time?

‘We must replicate to our DR site synchronously…’

I have rarely seen a true need for synchronous replication in any environment. In some cases, I have seen some proprietary applications with non-standard databases that, due to the inability to roll-back transactions, could only utilize synchronous replication for confirmed writes to the DR site. However, given the distance limitations of synchronous replication – over 65 miles or so can cause serious latency and I/O performance issues – it’s to a business continuity site not a DR site. The bandwidth issues associated with synchronously replication are easy – just throw more money into the project. However, the latency issue is tough, and the industry is still trying to figure out the best way to tackle this problem.

‘I can declare a disaster anytime I need to…’

The larger the company/corporation, the more nebulous the concept of who can declare a disaster and when. Remember that the clock starts ticking on your RTO once the disaster is declared as opposed to when the disaster occurs. When a tornado hits your data center at 2 a.m. on a Sunday, declaration is probably after the event. When the hurricane is bearing down and has now registered a category 4, declaration is probably before (hopefully, well before) the event.

Even more complicating is the discussion about a disaster affecting an entire data center versus “application level” disaster declarations. In this situation, the site might be fine, but a single application can declare its own disaster and fail-over to another site for any number of reasons. It’s fair to note that the former is much less complex than the latter. When taking into account things like internal and external IP addressing and DNS, standard operating procedures after “day one” such as backups and monitoring, and a myriad of other ongoing issues, you need to ask yourself this question: “Do I know where my application is today? And do I know how it is performing?”

In Conclusion

While every organization has a slightly different take on DR, after seeing the investment and how seriously it’s taken, or not, I have seen too many companies waste time, effort, and resources in a DR approach that is not sound or well constructed. Remember that, according to a recent Global Disaster Recovery Preparedness survey of 250 Disaster Recovery Journal readers conducted with Forrester Research, more than a quarter of the respondents declared some type of disaster in the past five years. Think of that – one in four of your peers declared a disaster sometime between now and five years ago. So, if you find yourself questioning your DR approach, it may be time to slow down and launch a concerted effort to get back to DR basics.

Bill Peldzus is vice president at GlassHouse Technologies, specializing in strategy and development of data center, business continuity, and disaster recovery services. Peldzus brings more than 25 years of experience working in technical and leadership positions at Imation Corporation’s Storage Professional Services and StorageTek’s SAN Operations business group, as well as running multiple IT groups at CNA Insurance and Northern Trust Company. Peldzus often serves as a content expert, keynote speaker, and author in numerous IT areas of specialty.

Last year's devastating events in Japan were yet another reminder of the vulnerability of our business operations in general and most specifically our supply chains. While the full impact on organizations worldwide may never be known, there are lessons to be learned.

It is a fortunate truth that few organizations will ever experience the combined direct impact of a black swan event like Japan's estimated 9.0 earthquake of March 11, 2011, the fourth largest in the world since 1900. This was the resulting domino effect with a sequence of multiple events occurring in a relatively short period of time … an earthquake, a tsunami, and nuclear power plant damage and shutdowns.

Widely diverse industries including automotive, computers, electronic components, industrial equipment, steel, textiles, and processed foods felt the effects. A partial list of Japanese and global businesses experiencing direct damage and/or significant supply chain interruptions includes some of the world's biggest and best known companies: Apple, Canon Inc., Cosmo Oil Company, Fujitsu Ltd., Honda, Mazda, Mitsubishi, Nissan Motor Co., Oriental Land Co. (Disney), Panasonic Corp., Sapporo Breweries Ltd., Sony Corp., Sumitomo Metal Industries, Subaru, Suzuki, TDK, Texas Instruments, Tohoka Electric Power Co., Tokyo Electric Power Co., Tokyo Gas Company, Toshiba Corp., Toyota Motor Corp., and Vopak.

Supply chain disruptions were far-reaching with transportation companies, suppliers, tier suppliers, supply chain strands, outsourcing companies, contractors, and customers all feeling the impact.

It is not only the international behemoths that suffered resulting losses, though they are the expected focus of most media reports. Coeli Carr's article "A Sake Story," (Portfolio.com March 17, 2011 http://www.portfolio.com/business-news/2011/03/17/importers-worried-about-sake-business-following-japan-earthquake/#ixzz1VrZwLETJ) reported the disruption and uncertainty for a U.S. importer of Japanese sake and the related impacts on U.S. importers, distributors, retailers, and restaurants, as well as the Japanese distribution companies, sake breweries, and rice growers.

As worldwide attention to the continuing consequences continues to wane, in Japan heightened concerns about nuclear power are raising questions about whether some or all 54 of Japan's reactors could go off line as a result of continuing fears. According to Reuters, with nuclear power plants providing one-third of its power, "Japan could face a power shortage of more than 9 percent next summer if all its nuclear reactors are shut down, media reported Wednesday citing government estimates."

No one invests in earthquake preparedness and mitigation like the Japanese. A system of bullet train trackside seismometers detect a coming earthquake, causing train operation controls to activate and launch a shutdown command to trains that reduce speed or bring the train to a rapid halt based on earthquake acceleration levels. Reports are that 27 bullet trains running in the affected areas on March 11 avoided derailment by applying emergency breaking 9 seconds before the shaking began, 70 seconds prior to the most violent tremors.

Infrastructure equipment has been strengthened to withstand earthquake motion. Japan's undersea cables remained mostly intact, allowing much-needed Internet communication and response in many areas.

Building codes in Japan are among the world's most stringent. Structural mitigation includes buildings with fluid-filled shock absorbers, Teflon foundation pads, reinforced walls and foundations, and ongoing increased engineering and earthquake-proof designs. Seawalls, considered a blot on the landscape by many, protect the coast from tsunami waves.

An earthquake warning system notifies officials and the public via phone messaging as well as traditional media. There is even an earthquake warning app that sends out an alert to let subscribers know when a quake is coming, where the epicenter is located, and how bad the shaking is expected to be. A billion-dollar system, a network of more than a thousand GPS-based sensors, provides tsunami warnings that allow several minutes to evacuate before waves start hitting.

But March 11 happened, yet another reminder that we cannot prevent or fully mitigate all disasters. So what are we to learn? What are the lessons we can take from this latest disaster?

Definitions of "lesson" include: An experience, example, or observation that imparts beneficial new knowledge or wisdom; the knowledge or wisdom so acquired. Ergo, a lesson is knowledge or understanding gained by experience, whether that is a negative or a positive experience, our own experience or the experience of another. Our lessons may address information that is relevant and applicable or perhaps simply of interest. Not acting on the knowledge gained is not learning the lesson, only observing the lesson.

As an example, let's consider a person, while studying the physics of electricity and metal objects, observes that a steel rod raised in the air during a storm will possibly be struck by lightning. If that same individual is struck by lightning while playing a round of golf the following weekend, did he actually learn the lesson and take preventative action or simply observe the lesson and nevertheless put himself at risk? Observing the lesson: knowing that the structure of metals makes them good conductors of electricity and therefore at risk for lightning strikes. Learning the lesson: never playing golf during a lightning storm.

To move beyond simply observing lessons from the events in Japan to learning the lessons, here are seven business continuity practices to act upon.

  1. Be proactive
  2. Build in backups and redundancy
  3. Fully include the supply chain
  4. Assess and measure continuity program effectiveness
  5. Develop continuity partnerships
  6. Address people's considerations
  7. Strive for progress, improvement, maturity

1. Be Proactive. It's as true today as it has ever been. The best return on investment is funds spent before the disaster. Success of recovery is directly related to the quality of the planning, the training provided, and the testing that occurs before the disaster.

2. Build in Backups and Redundancy. That includes people, facilities, equipment, data, suppliers, service providers, transport companies, and outsourcing companies.

3. Include the Supply Chain in Business Continuity Planning. Every link, in every phase of the planning lifecycle. Today's global supply chains are more susceptible than ever before to risks that encompass everything from power outages to pirate attacks. Conduct a post-March 11 supply chain review and reassess the risks for all supply chain links and touch points … internal, upstream, and downstream:

  • Manufacturing
  • Suppliers: raw materials, components, parts
  • Outsourcing companies
  • Service providers
  • Transportation services
  • Distribution services
  • Warehouses
  • Retailers
  • Wholesalers
  • Supporting technology
  • Utility service providers
  • Contractors

Based on the review, identify and implement needed mitigation actions and update continuity strategies and plans.

Beyond the traditional criteria of quality and price, consider business continuity capability in the supplier selection process. Select low-risk suppliers and identify those that may be high-risk by asking relevant questions when considering new suppliers and before renewing contracts with current suppliers.

  • Do they have a business continuity program?
  • Are we aware of their vulnerabilities?
  • How transparent are their operations?
  • How critical is the product or service?
  • Is this a single point of failure?
  • Do we have workable alternatives?
  • Will their security measures protect our data and intellectual property?
  • Will they jeopardize any regulatory or legal requirements?
  • Can they create liability issues for us?
  • How financially healthy is the company?
  • For current suppliers: Have we experienced deteriorating service levels?

4. Measure and Manage. Continue to assess Business Continuity Programs to measure program effectiveness, yours and that of suppliers and other business partners. Include continuity in selection metrics and scorecards, and continue to monitor after the selection is made.

5. Address People Considerations. With every major disaster we are reminded of the importance of people and how they truly are an organization's most important asset.

  • Cross-train for critical functions and skill sets
  • Develop communication plans and capability
  • Document detailed operating procedures to allow back-up personnel to carry out critical business functions when necessary
  • Encourage employee home and family disaster preparedness

6. Develop Continuity Partnerships. Move beyond the traditional customer-supplier us vs. them relationship mentality. Develop collaborative, mutually beneficial partnerships with suppliers, outsourcing companies, service providers, and the rest of your business partners.

  • Communicate; let them know what your expectations are by sharing business continuity standards and policies
  • Provide sources for needed business continuity training or coaching
  • Present supplier business continuity workshops to provide guidance in developing their program
  • Include suppliers, contractors, and more in your continuity training sessions and exercises
  • Work together to develop collaborative continuity strategies and solutions
  • Be open to their suggestions and ideas for developing continuity capability
  • Identify and offer corrective actions to reduce risks

7. Progress, Improvement, Maturity. Things change, and old solutions don't always work for new problems.

  • Keep up-to-date through performance monitoring, tests, and benchmarking
  • Ensure that operational changes and new products, processes, and acquisitions are added to plans
  • Continually reassess internal and external risks; adjust as necessary
  • Use advanced supply-chain tools for a complete risk assessment

Are there lessons to be learned from Japan? Were there lessons to be learned from the Eyjafjallajokull ash cloud when air travel disruptions were the highest since World War II and resulted in losses in the billions with impacts to companies in Africa, Asia, Europe, and North America?

Were there lessons to be learned from Katrina in 2005 when continuity team members and critical staff were not available or able to reach their assigned location, cargo shipments were halted or limited, and recovery time estimates were found to be greatly overly optimistic?

Were there lessons to be learned in the 2001 terrorist attacks such as the need to conduct regular reviews, updates, and test plans; provide cross-training to ensure backups for continuity team members; plan for potential for wide-area disasters, loss of multiple sites, and long-term recovery; build in communication redundancies, and plan for significant transportation interruptions?

The answer to each of these questions is a resounding, "YES." Every disaster presents us with lessons and two options.

Lesson ObservedLesson Observed
+No Change+Positive Change
=Lesson Observed=Lesson Learned

The first is to observe the lessons, take no positive action, move on, and hope ours is not the company struck by lightning. The second is to observe the lessons, make positive changes, and be better prepared for the next storm.

Make a commitment to learn the lessons of the tragedy in Japan. Start today and implement positive changes before the lessons that were observed begin to fade.

Betty A. Kildow, CBCP, FBCI, has been a business continuity consultant for two decades, working with a broad range of companies and organizations in the development and implementation of tailored programs to manage risk. Kildow has been a strong proponent of supply chain business continuity. She is also the author of "A Supply Chain Management Guide to Business Continuity" (AMACOM 2011). She can be contacted at This email address is being protected from spambots. You need JavaScript enabled to view it..