While regular backups are part of a viable disaster recovery strategy, backups alone do not provide a complete solution. Backing up critical server data into a tape library is an excellent practice, but if the entire building burns, the backups are gone too. Further, a complete disaster recovery solution needs to address the real possibility that the primary computing infrastructure may be unavailable, and another infrastructure may be required on short notice to provide a new computing footprint into which critical applications and data can be recovered. This temporary footprint may be located at another company facility, or provided by a third party. In the latter case, the equipment in question has to undergo a “bare metal restore,” in which all appropriate operating systems and applications are reloaded to a baseline state before any company data can be restored. Together with a good data protection practice, these provisions will help provide a more complete and robust disaster recovery plan.
To determine exactly how much a viable disaster recovery plan is worth to your business, you need a thorough understanding of the value of the company’s critical applications and data stores, and the infrastructure required to support them. Next, a protection strategy must be developed which prioritizes these assets relative to their business importance. Remember, these strategies must accommodate different levels of severity: a “disaster” may be anything from a lost file or corrupted database, to a rampant computer virus, to a man-made or natural catastrophe that destroys all virtual and physical assets.
Understanding Your Data
A company’s real data growth can vary greatly from industry averages: the problem is, most companies don’t know by how much. Software tools for monitoring data storage capacity and utilization across an enterprise are expensive and hard to find, and most companies don’t have the time or technical expertise to regularly perform such an analysis. Consequently, most companies have widely distributed data stores that are difficult to classify as “mission critical” or “non-mission critical,” and which often lack backup and recovery plans commensurate with their importance to the business.
Before developing a disaster recovery plan, it’s important to understand the recovery requirements for various applications. Resources can then be prioritized appropriately to minimize impact to the business‚ should a disaster occur. There are two main criteria for prioritizing your critical applications and data:
Speed to Recovery: How long can your organization live without this data or application? What are the effects on the business for each hour of downtime you experience?
Recoverability: What would be the impact if you lost the last hour’s data? The last four hours? The last 24 hours?
Speed to Recovery
Imagine you are an auto manufacturer that relies on a line schedule system to support your manufacturing facilities 24 hours a day. In this case, the impact on the business can be measured in terms of lost production. That is, if you produce $250,000 an hour worth of automobiles, a four-hour outage would cost you a million dollars.
If you are a utilities company whose system outages leave the public without phone service, the business impact of an outage may be measured in terms of loss of customer confidence, potential legal liability or quantified as service level violations that require monetary compensation to customers.
Or, perhaps you are a clothing retailer with an experimental Web site – which does not yet support any sales or transactions – and it goes down. In this case, you may not experience any significant business impact.
Practically speaking, a company may have a mix of potential consequences. Outages of manufacturing or sales systems run the risk of costing millions of dollars, while outages of static Web content or archived data files may be relatively insignificant. For each of your key business applications, you should (a) prioritize the applications and data stores in your organization relative to each other, and (b) understand the true financial impact over time of the unavailability of that application or data. The following questions will help you assess the consequences of an unplanned outage:
• When does the unavailability of this application/data store significantly impact the business?
• Does this application/data store generate revenue? If so, how much revenue does it generate in a minute, an hour or a day?
• What are the potential dollar losses that would occur if this application/data store were unavailable for an hour?
• What are the intangible losses (i.e., loss of customer confidence) that would occur in the event of unavailability for an hour? A day?
• Are there applications/data stores that you have identified as non-mission critical that could have a greater impact if they were unavailable for a prolonged period of time? (i.e., does their criticality escalate?) How long of an outage could be tolerated on these systems before significantly impacting the business?
• How quickly can you recover this application/data store in the event of an outage? Data corruption? A fire? Man-made or natural catastrophe?
• How long did it take you to recover this application/data store in the last actual disaster or disaster recovery test?
• Has a cost and risk analysis already been performed for this application/data store?
• Do you understand how one hour of unavailability impacts the profitability of your company?
• How many customers will choose to deal with another company if your application/data store is down for an hour? Twenty-four hours?
• How will an hour of application downtime impact your production schedule?
• Will you need to send employees home because they cannot continue to work without this application/data store? After what length of asset unavailability will you send them home?
• What replacement infrastructure may be required to restore accessibility to this asset?
Recoverability of Data
There may be some situations where an unplanned outage deteriorates into a completely unrecoverable situation. Therefore, it’s important to prioritize applications not only in terms of how quickly you need to recover them, but also how closely you need to guard against actual data loss.
Let’s look at a real example: in the first World Trade Center bombings in 1993, 43 percent of the businesses that experienced substantial data loss never re-opened. Another 29 percent went out of business within two years. The long-term consequences of the Sept. 11 attacks have yet to fully register, but can already be measured in the billions of dollars. The threat of data loss to your business is real; guard your most critical applications against that threat.
Consider the following questions as you prioritize your applications and seek to understand the financial and business impact of permanent data loss:
• If application/data loss were to occur, would there be a way to recreate that data – i.e., re-entry of manual work-orders or forms? What is the cost of that manual recreation of data?
• What would be the impact of permanent loss of the last hour’s worth of data? The last 24 hours? The last week’s?
• For applications where permanent loss of data appears to have little or no impact on the business, will this information be required at some point in the future? What might this information have been used for and what are the anticipated losses from the inability to access it?
• Are you required by a regulatory agency or stakeholder to make this data available for audit? What are the potential liabilities for not having this data?
The Cost of Downtime
Here are some useful equations to help you calculate the cost of downtime.
Total Business Lost = (Gross Revenue) x (% of Lost Customers due to Outage)
• This equation is targeted at transaction-based organizations that might see customer attrition as an impact of unreliable availability.
Application-Based Business Lost = ((Annual Gross Revenue generated by Application / 365 days/year) / (24 hours/day))
• This formula calculates the per hour impact a specific application’s unavailability has on overall corporate revenue.
Lost Production Capacity = (Number of Units Not Produced) x (Unit Price)
• Used specifically for manufacturing organizations, this calculation determines the lost production due to application unavailability.
Lost Net Revenue = (Number of Units Not Produced) x ((Unit Price) – (Unit Production Cost))
• Again, this formula is primarily for manufacturing organizations and helps quantify the cost of lost production time due to application unavailability.
The following formulas will help you calculate the total cost of a specific application outage:
Total Cost of Recovery = Cost of People Time Lost + Cost of Lost Data + Replacement Infrastructure + Cost of Recovery Services
Cost of People Time Lost = ((Average Time to Recover) x (Average Wage of Users) x (Number of Users))
Cost of Lost Data = ((Gross Revenues / Business Days per Year) x (Percentage of Data Unrecoverable))
Let’s look at the following example. Company ABC gross revenues generated from an application of $10,000,000 per year:
• 150 employees use the application to generate revenue and their average wage is $10 an hour;
• The average time to recover the data is four hours;
• The business runs 250 days a year.Downtime Cost Impact
250 Days in operation
$40,000 Revenue per day
24 Hours in day
$1,667 Revenue per hour
4 Hours down
$6,667 Downtime cost
4 Hours down
600 Employee downtime hours
$10 Hourly wage per employee
$6,000 Employee productivity impact
Downtime Costs By Industry
Downtime costs will vary by industry and are largely dependent on a company’s dependence on technology and data. The following chart illustrates the average downtime per hour for many industries, but remember that vulnerability to data unavailability and loss isn’t just limited to monetary impact, it also includes such things as loss of customer confidence, liability, and lost current and future business.Industry Hourly Downtime Costs
Brokerage Operations $6,450,000
Credit Card Sales Authorizations $2,600,000
Financial Institutions $1,495,134
Information Technology $1,344,461
Food/Beverage Processing $804,192
Consumer Products $785,719
Metals/Natural Resources $580,588
Professional Services $532,510
Construction and Engineering $389,601
Hospitality and Travel $330,654
Pay-Per-View TV $150,000
Home Shopping TV $113,000
Catalog Sales $90,000
Airline Reservations $90,000
Tele-Ticket Sales $69,000
Package Shipping $28,000
ATM Fees $14,500
Sources: IT Performance Engineering and Measurement Strategies: Quantifying Performance and Loss, Meta Group, Oct. 2000; Fibre Channel Industry Association.
Data Protection Options
Once you understand the value of each of your applications, both in terms of downtime and data loss, you can begin to assemble the appropriate disaster recovery strategy for your business. This strategy should include provisions for recovering data, applications, and if necessary, the requisite hardware infrastructure. Some typical strategies for data protection follow. Note that any of these strategies may be deployed against one or more of your critical applications, and a mix of strategies may in fact be the best solution for your particular situation.
Regular Backup Regimen
A full discussion of backup methodologies is beyond the scope of this paper, but it’s clear a regular backup regimen is the first line of defense against data loss from unplanned outages. The backup regimen need not be complicated, but it must be followed consistently in order to be effective. Effective backup strategies usually include local and remote copies of data (see below) and some mix of full, incremental and differential data capture. Typically a high density, low cost media such as magnetic tape will be used to retain data for a period of weeks to months, and then the media will be recycled.
Remote Data Mirroring
Remote data mirroring offers the highest levels of availability and business continuance by synchronously (no delay), near-synchronously (minimal delay), or asynchronously (definable delay) replicating data from your on site disk array over a secure network to a hot site facility. In the event of an outage or a disaster, systems may then point to the mirrored copy and continue operations or the primary data store may be restored from the mirror with little or no data loss.
Business Continuance Volumes (BCV’s)
BCV’s are snapshots of all or part of a disk filesystem that are taken periodically and stored in another disk allocation. For example, an online e-tailer may choose to generate a BCV once every hour and maintain at least four BCV’s at any point in time. For example, in the event widespread database corruption occurs, rather than going back to the previous nights tape backup and losing all of the current day’s transactions, the e-tailer may resort to the earliest BCV in which the corruption does not exist. Essentially the BCV’s provide periodic tape backups to restore from in case of an emergency.
Remote Tape Backup
Remote tape backup is simply tape backup done over a point-to-point or VPN connection from your site to a secure, off site facility. This can be the primary backup regimen, or can be performed as an adjunct to on-premises backups. This differs from off site tape archiving in that the assets remain readily accessible at the remote site, rather than being parked on a shelf. Various storage service providers can automate and facilitate this process.
Off Site Tape Archiving
Off site tape archiving provides the least accessible data storage option, but offers a low-cost option for long-term data archiving. Tapes are taken off site by an archiving company and stored in a secure, hardened facility for as many months or years as you specify. Tapes will be delivered back to you if you should need to access the data stored on the tapes. The off site archive is like a bank vault: it keeps your data safe from fire, theft, natural disaster and damage.
Rationale For Outsourcing Data Protection And Disaster Recovery Services
After outlining the disaster recovery strategy that works best for your company, you may want to consider how exactly you will go about implementing the strategy. The decision to develop a disaster recovery plan entails a variety of subsequent tasks that you may or may not have the time and qualified resources to perform, such as:
• Hardware and software evaluation;
• Technology and service provider evaluation;
• Network design and management;
• Infrastructure integration and installation;
• Disaster recovery plan maintenance and monitoring procedures;
• Business continuance procedures;
• Data restore procedures;
• Disaster recovery plan test procedures and auditing schedule;
• Test/audit results documentation;
• Periodic disaster recovery plan validation to ensure they remain in line with company requirements.
Given the breadth of these tasks, and expertise they require, the outsourcing of the design, implementation, management, maintenance and monitoring of a data protection and disaster recovery practice is a reasonable solution to the DR question. Storage service providers with core competencies in these areas rely on their specialized technical acumen, including years of storage and networking expertise, along with specialized software tools for device, network and storage management. Together these core competencies and specialized tools enable storage service providers to achieve the highest levels of availability, performance and data security more efficiently and cost-effectively than customers could otherwise achieve on their own. Further, the outsourcing of these laborious and time-consuming tasks enables customers to re-focus key IT personnel on strategic tasks and business objectives rather than being consumed with maintenance-related chores. Storage service providers will typically provide these services under the terms of a contract and service level agreements that guarantee the availability, security and reliability of data.
Industry research has shown that data backup and disaster recovery practices have typically been perennial sources of difficulty for IT professionals in that they take up a great deal of time each day, they are often boring and cumbersome chores, and they are only visible when something goes wrong. Therefore, outsourcing these perennial headaches is often a very attractive proposition for IT professionals with better things to do. There are several key technical and financial reasons outsourcing of data protection and DR services makes sense:
Technical arguments in favor of outsourcing include: higher backup success rates, better restore SLAs to other departments, standardized reporting of results, centralized command and control of all data protection across the (wide-area) enterprise, single escalation path for support issues, etc.
Financial arguments for outsourcing include: the ability to re-focus or re-deploy some percentage of IT resources, higher resource utilization can defer new purchases, more accurate reporting of resource allocation, better growth prediction and resource planning, centralized billing and accounting information, better accountability of resources at branch offices, etc.
Walt Hinton is the chief technical officer at ManagedStorage International (MSI) and Rob Clements is an integration specialist. MSI is a global provider of complete storage solutions, helping enterprise companies and leading service providers strategically manage storage as a critical corporate resource. For more information, please visit http://www.managedstorage.com.