Obviously, people are the most important element of any organization, but without the supporting tools and technology, even the most talented of people will not be able to meet the demands of today’s competitive, complex and fast-paced business climate.
The “mean time to recover” is a term well known by any business continuity professional. Time is one of the key drivers that define the recovery strategies needed to enable required functionality. The “mean time to recover” duration has been shrinking steadily for IT enabled resources, to the point where some applications simply cannot go down at all. This high level requirement exists mostly in military, critical infrastructure (power, telephone, etc.) and financial environments today, where outages of even one hour could be catastrophic. More and more though, availability requirements for businesses outside these key areas are measured in hours, not days or weeks.
The changes in disaster recovery and business continuity strategies have not kept up with the changes in business requirements and technology for far too many organizations. The strategy of recovering critical computer-based functionality from tape-based recovery strategies is often unreliable, complex, and time consuming. This is especially true when recovery strategies are required to support site-based outages, where an entire infrastructure needs to be recreated. The difficulty of rebuilding and restoring from tape a dynamic IT infrastructure that has taken years to evolve – all in a 24- to 48-hour time frame, and under adverse conditions – is simply unrealistic.
How Did We Get Here?
The above paragraphs contain information that is common knowledge for some in the business world, and should be understood intimately by anyone in the disaster recovery or business continuity field. Yet the true recovery capabilities and strategies of most organizations fall short of the business objectives that originally drove them.
The sheer growth in the volume of data we are managing has made backup and recovery from tape extremely difficult.
Tape backup is still a critical part of any data management process. Tape backup and restore will continue to play a key role in data recovery, data management, and archiving. But the role of tape backup and restores in business continuity is changing, and in many cases, will be all but eliminated.
The proliferation of “open systems” architectures lead the charge away from the proprietary mini and mainframe oriented data processing floors, to UNIX and Windows based environments. The business world very quickly embraced the nimble and “less expensive” alternatives to the proprietary mainframe based data centers, moving to a client/server approach to developing new applications, and porting old applications to this new infrastructure.
The Internet, fueled by incredible advancements in networking, has changed the way we do business, and even the way we live. The ease of doing business on the Web coupled with the relatively low cost of deployment has further fueled this growth. Many jobs of old have been replaced by the functionality and speed offered by today’s computer-based solutions.
The technological advancements we have witnessed in the computer industry in the past 10 years are truly staggering. These advancements have touched so many aspects of the industry that nothing is as it was a decade ago. Performance, capacity, bandwidth, scalability, redundancy, manageability, interoperability, and reliability are just some of the key areas where this remarkable industry has made exponential gains. We have gone from measuring our disk storage space in megabytes to gigabytes, terabytes and now petabytes. A single desktop computer can offer CPU performance greater than that contained in an entire glass house datacentre only 15 years ago. The network bandwidth capabilities have seen similar growth and performance gains, not to mention the impact business has experience via the proliferation of the Internet.
The sheer volume of data we are dealing with has grown exponentially. The types of data that exist are diverse, often requiring different backup and restore tools and methods. Full system restores from tape, including the operating system, network, applications, and databases, is a very complex task and time consuming task. In fact, to accomplish a full system restore, your strategy for recovering from tape must be considered when designing and building the system in the first place. The way in which backups are done (full backup, differential, incremental, database, etc.) will dictate and limit the way in which they can be used in a recovery. The emphasis in many environments is to speed up the backup process to fit within specific maintenance windows. Usually, methods that speed up backup (spreading data over multiple tape drives, multiplexing numerous backups on one tape, incremental backups, differential backups, etc.) will slow down the restore process later on. Also, many centralized backup products store data on tape in proprietary formats. This requires the recovery of the tape backup/restore infrastructure prior to being able to utilize it to recover your actual data systems and data – an often overlooked issue.
Enterprise IT environments evolve over time; they are not purchased as a turnkey solution. In many large computer shops, brand new state-of-the-art systems are sitting beside and inter-operating with 10-year-old legacy systems. These systems, applications and services can evolve over a long period of time, requiring the talents of many people with diverse and unique skills. These environments seem to be in a constant state of flux, growing, changing, and moving. Often, critical components within an infrastructure were implemented years ago by someone no longer with the organization and without proper documentation. This does not pose a problem until something goes wrong. In most isolated cases, existing staff can resolve these problems. But in a disaster scenario, where many elements of a critical application must be recovered simultaneously, there may be a number of these unknown entities that need to be rebuilt. In a trouble-shooting scenario, the number of variables that exist can exponentially affect the time required to restore things to an operational state.
The tools and applications that an end user or customer sees today are typically provided via a combination of servers, operating systems, applications, databases, networks, protocol, and services. The complex sequence of events that culminate in a client receiving an e-mail are rarely seen or understood by most people, not to mention what happens in more complex transaction-based environments containing multiple data bases, application servers, networks, and clients. The number of single points of failure that exist in this sequence of events is staggering. The precision with which the individual parts need to work together to present a single application is complex and critical.
The move to “open systems” from “proprietary systems” has further fueled this storm of change. New products and vendors are appearing on – and disappearing from – the landscape on a regular basis. Interoperability (the ability for one product to function with another) was a declaration made to ease the fears of integration of a variety of disparate products. The question usually asked vendors was, “Can your product do X?” and the answer was usually a resounding, “Absolutely!” Perhaps the question should have been, “What does it take for your product to do X?” The difficulty of implementing and managing a diverse and highly interdependent “open” IT infrastructure can often cost more – in terms of quality, dollars, staff, manageability and time – than what was a similar solution on a proprietary infrastructure. That said, it is clear that this technology has fueled the creative fires of the industry, injecting new ideas, capabilities and vision – it is here to stay.
The philosophies around emergency preparedness in business have also undergone changes, albeit more subtle. The term disaster recovery is used less frequently in favor of business continuity. The message here is clear – recovery implies that something has been disrupted or stopped, and has been brought back to a functional state. Continuance implies no disruption – a much-preferred condition.
The primary model for vendor-based hot-site services is that of a subscription to a pool of shared resources. That is, the vendors allow a number of subscribers to contract against a shared pool of resources at a defined rate. For example, there may be 30 subscribers to a single resource (be it a mainframe, UNIX server, network, desktop, etc.). Should a disaster occur affecting numerous customers, a first-come, first-serve approach to providing access to the subscribed to resources is in effect. Most large recovery vendors have resources to accommodate more than one customer declaring a disaster at one time, but there is a limit. There are typically no guarantees that you will have access to the equipment, space, and resources you have subscribed to in the event of a disaster.
For subscription-based hot site and recovery centers, the ability to test your recovery plans is also limited. Typically, a 24-hour annual test window is provided, but this number is contract specific and can be negotiated. More test time usually means more money. The time allotted to testing is often insufficient to fully rebuild and properly test your recovery strategies. Travel often comes into play for recovery personnel, both in a disaster or test scenario. The equipment you will be using is usually shared between many customers, making customization difficult.
Revision levels, change, and currency of the systems can also add to the challenge. Your recovery procedures can be further complicated by the need to allow for variance in the target systems you will be recovering. A key point here is a lack of control of the target recovery environment. These are difficult problems to manage in open systems architectures, where supported configuration issues are complex and somewhat dynamic.
The following are just some of the issues and facts that contribute to the challenges of creating and maintaining true disaster recovery and business continuity capabilities:
• Increased reliance on computer systems, data, and infrastructure
• 7 x 24 – a daunting requirement!
• Greatly decreased downtime and maintenance windows
• Requirement for increased application performance in direct conflict with increase in volume of data
• Impacts of outages can affect a number of areas including health and safety, security, financial, regulatory and legal requirements, customer satisfaction, employee satisfaction
• Manual fallback procedures often no longer viable in disaster scenario
• Huge increase in amount and types of data, directly impacting:
• IT budgets
• Disk, tape, and physical space requirements
• Network traffic
• System requirements (CPU, memory, etc.)
• Differing storage strategies (SAN, NAS, Direct attached, etc.)
• Management of data
• Backup and restore times
• Diverse data types, adding to complexity (platforms, versions, databases, applications, networks, etc.)
• Increase in complexity
• Critical interoperability and compatibility issues
• Large number of vendors and products
• FUD (Fear, Uncertainty, Doubt) affects managers forced to select and justify decisions based on vendor marketing
• Need for thorough, comprehensive, and current documentation greatly increased, yet documentation is typically low on priority list – significant risk
• Recovery and functionality of single applications rely heavily on large number of disparate systems (servers, databases, clients, network, storage, supporting applications, etc.)
• Increase in number and types of threats to IT
• Human error (the biggest threat to IT)
• Increase in malicious acts (terrorism, hackers, viruses, disgruntled employees, etc.)
• More “eggs in one basket” due to consolidation of data centers
• Rapid degree of change
• Corporate growth (downsizing, acquisitions and mergers)
• Technology changes (i.e. SAN, Internet, Voice Over IP)
• Potentially short product lifecycles (quick obsolescence)
• Market changes require dynamic environments
• Philosophical changes (i.e. outsourcing, e-business, centralization, de-centralization)
• Some other key elements affecting IT environment
• Huge growth in number of people employed in IT over a relatively short period of time (dilution of talent)
• Eroding budget vs. increased expectations
• Increase in technology vs. decrease in staff
• Employee turnover
• Training requirements (rapid obsolescence)
The combination of a number of factors in the industry has given us a false sense of security about our IT infrastructure and its continued well-being. Although it is true that the frequency of system failures, data and application outages has decreased, the difficulty of recovering via conventional means has much increased, and the impact of an outage, should one occur, can be catastrophic.
The biggest threat to the ability to recover from a disaster is the lack of practicing and testing of the procedures required to accomplish this most critical and difficult task. One of the key contributors to this misunderstood problem is, oddly enough, the increase in reliability of many of the products we deploy within our infrastructure today; for example, resilient disk storage systems (Raid5, mirroring, etc.), clustered server systems, database journaling, etc. Although the likelihood of a technical failure can be greatly decreased via high availability strategies, the ability to recover from an incident that HA cannot guard against has greatly diminished.
In all likelihood, the high availability environment within your organization took a large amount of resources to implement, from people, to dollars, to time. Expecting this type of environment to be recovered in hours without significant resources being directed to this capability prior to an incident is unrealistic.
So, What Now?
In a disaster scenario, you may not just be dealing with the need to fully recover a server from scratch, likely you will be doing so on hardware that is different to the production server you are trying to recover. Further, there may be numerous peripheral and supporting servers and services required to make critical applications functional, not the least of which will be network related. Compatibility and configuration issues can be overwhelming. The difficulty level goes through the roof when doing this under extreme pressure, likely without access to your existing environment, or perhaps even without some of the key personnel that built and managed your production environment.
Leveraging the high availability products, services and techniques available today is key to a viable business continuity strategy. A combination of resilient storage subsystems, clustered system environments, redundant networks, and a host of other HA products and techniques will pave the road to 7x24-hour operation, even in the event of a disaster. The key element in most of these solutions is that two identical and current copies of production data exist in at least two separate locations. This can be accomplished via a combination of redundant (mirrored) disk storage hardware, software and processes. Some organizations have leveraged their existing hot-site service providers to house and maintain infrastructure dedicated to them. This is a viable option for those organizations that do not have a second site where they can stage their own hot-site. These types of strategies will, in most cases, eliminate the need to restore production data from tape in a disaster scenario.
Keeping configurations standard and as simple as possible is an absolute requirement for long-term recovery and availability capabilities. In the complex and interdependent environments of today’s IT, the less customization the better. Further, with the frequent staff changes and typically inadequate documentation, intuitive and standard configurations are more easily supported and quickly recovered.
Change management processes must be implemented, strictly adhered to, and enforced. This must be mandated by the most senior level of management possible. ITIL (Information Technology Infrastructure Library) provides a framework for better managing IT, and includes some best practices in regards to change management.
New technologies such as SANs (storage area networks), fibre channel, networking, high availability disk arrays, and various disk virtualization and management products provide the means to bring customized hot-site solutions to organizations that would not have been able to afford these solutions in the past. In fact, many can’t afford not to implement these types of solutions.
The cost of implementing in-house hot-site solutions has decreased dramatically. The cost per megabyte of storage, along with the relative cost of bandwidth between two sites is at a level where this type of solution is becoming more palatable financially every day. In fact, the cost of most computer hardware products has been dropping steadily over the past few years. “Throwing hardware” at a solution can often be more cost effective, less time consuming, and less complex than having people develop, implement, maintain, and manage complex recovery solutions.
Leveraging internal test and development environments for disaster recovery purposes is also a viable alternative. The equipment typically used for testing can also be used in a disaster scenario. But be very careful when implementing a multi-function environment such as this. Very strict rules of engagement and change management must be deployed to ensure that one objective does not negatively affect the other. Again, strong management is a key to successfully implementing this type of strategy.
The Bottom Line
The benefits of implementing mirrored site solutions go far beyond mere recovery capabilities. Other benefits can include cost savings, reliability, centralized management, scalability, decreased maintenance windows, performance and security.
In these ever changing times, we as professionals must be vigilant in protecting the organizations for which we work. Perhaps now more than ever is a good time to review the processes, procedures and tools we employ to guarantee the continued viability of our businesses. It is difficult to keep up with technological changes and advancements, but not investigating the possibilities, and making false assumptions about the viability and affordability of new and better recovery and continuity capabilities, is the biggest risk of all. “We have always done it this way” is not a convincing argument to why we are doing things in a specific way today.
Andre Noenchen currently works for Infostream Technologies Inc. as a senior consultant, specializing in the design and implementation of high availability SAN-based infrastructures.