Where are your IT users?
In the past, to achieve the required level of performance at an acceptable cost, the users of corporate information were usually co-located with the majority of the data they needed. Chances were that a major disaster that impacted the data, also impacted the users of that data, or the facilities and infrastructure they used to access the data. The focus of disaster planning, therefore, was on a complete recovery solution for facilities, infrastructure and personnel. For recovery of data, the extent of data loss, called recovery point objective (RPO), was a focus. Recovering from a stale nightly tape backup made many hours before the disaster could take hours or days to restore to a server. Manually recovering lost transactions that occurred after the backup may also be required.
Can local solutions solve remote problems?
Solutions to minimize data loss involve continuous, online backup of data to a secondary system at a distant location; as a result a variety of products emerged to address these needs.
Many of these systems tried to simply extend local redundancy techniques such as disk mirroring and relied on proprietary hardware. As a result of attempting to adapt techniques designed for local protection to the problem of distributed data redundancy, many of these solutions required significant amounts of dedicated bandwidth and imposed significant performance penalties on the production application in order to allow the remote system to keep up. Other technologies took a fresh look at the problem and developed innovative solutions specifically designed for distributed data replication. (See Figure 1)
Internet connectivity changes the problem
oday, the availability of inexpensive wide area bandwidth and global Internet connectivity has changed to the point where users can now access the data that they need from virtually anywhere, at any time, meaning that any network downtime for any reason and at any time can impact users located around the globe.
n terms of disaster planning, this shift changed both the problem and the potential solutions completely. The prospect of a site disaster such as fire or flood or a regional disaster such as an earthquake or hurricane destroying or preventing access to a critical server now may leave hundreds or thousands of distributed users, unaffected locally by the disaster, without access to the application. Also, with the rise in Internet and extranet applications, it's not only internal users relying on this data but partners and customers. The focus has shifted from simply RPO to total recovery time with the ideal scenario often being nearly immediate resumption of the application with up-to-the-minute data to an alternate location. With today's networking technology and intelligent applications, requests can be easily rerouted from one location to another.
Internet connectivity changes the solution
The bright side is that these same changes in technology have also made the process of continuous off-site data vaulting, even with fast fail over capabilities, available to almost any organization at cost-effective levels. With the cost of wide area bandwidth and storage both plummeting, what company cannot cost justify a standby server hosted at another company location or even co-located at an ISP, VAR or other service provider to protect their most critical data? For some organizations, it could be a spare server at an employee's home in the suburbs with a DSL connection to continuously receive updates as they occur. For some organizations, a DSL connection to a spare server located at another company location or co-located at an ISP provides a very inexpensive solution to protecting corporate data that was never before possible.
But these solutions easily scale up. A major east coast bank, for example, uses such technology to replicate the data from 24 branch locations to a secure, disaster-hardened data center using standard commercial servers, software and fractional T1 lines.
How much can you afford to lose?
Finding the right solution starts by carefully defining your specific requirements, which may vary from one application to another. How many users access the application and from what locations? What is the cost of downtime per minute or per hour? How re-creatable is the data? If your last backup finished at 6am and you had a disaster at 5pm, could you re-create the data created or modified during that period? What would it cost and how long would it take? What is the cost of that lost data?
What is the price of protection?
In an ideal world, you would have no data loss under any circumstance, not even a single in-process transaction, with zero impact on the user. Achieving that is difficult and requires expensive, specialized solutions and imposes additional overhead on the application reducing overall performance. Maintaining 'zero loss' requires the use of a two-phase commit approach where every transaction must be written and acknowledged in multiple locations before the next task can proceed. The performance penalty for this approach is significant but becomes proportionately greater as the distance between each system is increased simply because communication speed is limited by the speed of light in the very best case, and significantly less in the real world when network protocol and routing latency is factored in. A similar problem occurs using synchronous disk mirroring technologies across a distance where each write to a disk block needs to be written to the target drive and acknowledged, then written to the source drive, and committed to both before any subsequent read or write I/O can be processed by the disk subsystem. (see Figure 2)
The alternative to this synchronous approach is to buffer the changes and transmit them as fast as the available bandwidth will allow. As long as the available bandwidth is equal or greater to the rate of data change, data will be transmitted and applied almost instantaneously providing 'near zero' data loss without the round-trip delay overhead associated with synchronous mirroring or two-phase commit. During peaks of activity, if the rate of data change temporarily exceeds the available bandwidth, this could mean that seconds or even minutes of changes could be queued, waiting to be transmitted to the target system. Because these changes have not yet been sent offsite, in the event of a catastrophic failure, these changes could be lost, which is not possible with a synchronous system as described. Although, the same transactions would never occur at all in the synchronous model since everything would have been slowed down to the rate of data transmission to allow the mirroring to keep up.
Other technology differences to consider include where the changes are captured, what kinds of changes are visible and what level of object filtering is available. For example, a disk subsystem level mirroring system has no knowledge of the application or the operating system (OS) but simply sees disk drive commands and disk blocks. Consequently, the minimum amount of data that can be sent is a full disk block-- perhaps anywhere from 4K to 64K for every change. Filtering and selection is usually at the partition or disk level. Application level replication, for example, within a database using 'triggers' can capture individual transactions or SQL level commands but use application resources (cpu, memory, etc.) to execute. Filtering could actually be done down to individual database tables. Other products operate between the application and the disk giving the ability to filter down to the file level but still seeing exactly what the application is writing to the file system before it is converted into I/O data blocks on the drive. If a database transaction writes 2 bytes to the transaction logs and then 4 bytes to the database, that same sequence of steps can be captured and transmitted exactly to the offsite recovery server.
Data integrity vs. currency
Which raises the next critical point to consider: data integrity or consistency. With either continuous replication or periodic backups, it is important to make sure that the backup image is valid at any time, representing a snapshot of the original system at a given point in time. For either approach, this means having access to files even while they are in use. For continuous replication solutions this means ensuring that all of the data on the target system always represents a specific point in time from the source. Skipping any transactions or applying any transactions out of sequence can jeopardize this point in time consistency. One way this can occur is by using an adaptive replication process instead of a time sequenced queue. With an adaptive process, if a change cannot be transmitted immediately, rather than storing that entire changed data in a sequenced first-in first-out queue, only a pointer to the changed block is stored and the data is re-read from disk when it is ready to be transmitted. The problem occurs when the data has been changed multiple times on disk or even later deleted so that it cannot be read from disk when needed for transmission. As soon as this occurs, you have a combination of data on the target that never existed on the source, which may cause data corruption, particularly with databases.
Scalability and Flexibility
Finally, it is important to consider future changes and how the solution will adapt to increasing loads, future network technologies, or OS and hardware technology changes. Does the solution require the same server or storage hardware on both the source and target or can you mix and match? Does the connection for replication leverage open network standards such as TCP/IP natively or does it require protocol conversions or dedicated links? The bottom line is will the solution allow you to take advantage of the latest hardware and network technologies or are you locked in to what is available today? In general, software based solutions offer greater flexibility and are more easily upgraded and expanded than proprietary hardware based solutions. Some software solutions offer consistent product functionality on multiple server OS platforms, reducing training and support costs and allowing for centralized management and monitoring of multiple systems by a single administrator, even across different platforms.
In summary, recent trends toward distributed Internet based computing have both increased the need for continuous offsite data protection and made it more practical and cost efficient. While more and more industries are seeing legal mandates requiring such protection, for others a lack of attention to the issues of data availability and protection could be seen as negligence since the tools and techniques are mature and readily available at very reasonable costs, and the same technologies that allow distributed access to data can easily be extended to provide distributed data redundancy and protection.
Don Beeler is President and CEO of NSI Software, the developer of Double-Take data replication software for UNIX, Windows 2000/NT and NetWare and GeoCluster distributed clustering software. NSI holds patents on over 20 technologies related to real-time data backup and replication.