No matter what, transactional integrity must be maintained. That is to say, if you buy a security from me, it must be removed from my account and added to your account. If either part of this transaction fails (for example, because of a locked record), the entire transaction must fail. Figure 1 shows an example of this process using a transfer from a checking account to a savings account. Because trading floors are rapidly disappearing from the securities markets, and most large firms make their money through the use of “programmed trading” run by computers, there is no manual backup. If the exchange’s computers are down, the exchange itself is down with them.
While the RTO for an automated teller machine (ATM) application might also need to be as close to zero as possible, the RPO is not nearly as critical. Transactions made at ATMs tend to be well under $1,000, and they are usually confined to one account holder.
Again, you want transactional consistency so that if money is transferred between accounts or is withdrawn, there is an accurate record. But the RPO does not need to be zero because the database will not wildly diverge if one or more transactions are missing, and the ATM keeps a log that can be used for reconciliation later on.
We have seen two examples in which the RTO must be close to zero, but what about systems where the RTO can be longer? One example might be an electronic funds transfer (EFT) system. If the system is down, there are manual backup procedures, such as using a telephone or fax machine. However, the RPO is zero because the loss of even a single multibillion yen transaction can ruin your day (not to mention your career!).
While the RTO of a Web-based sales system must be close to zero, the RTO of the back-end systems (shipping and billing) can be much longer. If the back-end system is down, orders can be printed out and mailed or faxed to the warehouse and credit cards can be cleared manually. But if a Web-based system is down, customers will go elsewhere to do their business. Similarly, if the order entry system for a mail order business is down, orders can be processed manually.
If you have completed your impact analysis and understand the acceptable RTO and RPO, it is time to determine the best way to achieve them. Because I work for a computer manufacturer, I will focus the remainder of this article on the ways to meet your required objectives from a computing perspective.
Navigating The Continuum Using Tapes
RTO and RPO exist along a continuum, and we’ll look at various ways to meet your objectives starting with the longest and ending with continuous application availability across a site failure.
The most traditional recovery medium is magnetic tape. At some interval, you copy your computing environment from the disk to a tape and keep the tapes somewhere safe. The shorter the RPO, the shorter the backup interval. If at some point the backups are running continuously, a tape system may no longer be the right answer.
If the RTO is long, you simply order a new computer and load the tapes onto the new machine and start your application. If the RTO is short, you may pay to have a computer waiting for you to use when necessary (through a hot site or computer rental company). If you are evaluating the use of tapes, there are many considerations, such as:
• Where is your tape backup hardware?
• Where are tapes stored until they go off site?
• How quickly do your tapes go off site?
• Are multiple tape copies sent via different routes?
• Do you do tape retrieval and restore tests?
• Do you perform backups logically and ship recovery tapes in “waves?”
• Is your application in a quiesced state when you take backups?
The objective of taking tape backups is to have a safe copy of the information stored on your computer in case of a disaster. A disaster can be anything from someone accidentally purging or changing a file to total site destruction. The tape hardware should be located physically away from the computer that it is protecting so that an incident affecting the computer doesn’t take the tape drive out with it. If you have a fireproof silo, the tapes may be safe, but you might not be allowed to retrieve them because the area where the silo is located is off limits because it is unsafe to enter or is designated a crime scene.
Do you make multiple backup copies, and when you do send tapes off site, do you separate the copies? It may seem far-fetched, but it is possible that the truck ferrying the tapes from your site to the tape storage facility could be involved in an accident and the tapes damaged.
When you back up your system, do you do it in logical units needed for recovery?
By backing up your system using the same file sets that will be needed for the recovery process, you can save recovery time. While one set of tapes is being read, the next set needed for recovery can be in the process of being located and pulled from off-site storage.
For example, you could pull and ship the operating system tapes, followed by critical application tapes, followed by critical database tapes. You could then ship the next critical sets of tapes.
Calculate how many tape drives are needed for parallel file restoration and whether or not your database software can load multiple logs at the same time for speed. If your RTO is less than the time needed to read your tapes back in, you have already lost the battle.
The key points are to ensure that the RPO can be met by the frequency of backups and the RTO can be met by how fast you can get the information on the tapes back into the system and make the database consistent.
An open database is “fuzzy,” meaning it doesn’t provide a true picture of the state of your business. This inconsistency is masked in most databases because the records are locked until the database is made consistent. If you are running a backup tape while transactions are in progress, the information on your tape won’t be consistent not only because record locks are not written to the tape, but also because even if they were, one of the account records could be changed physically on the disk after the tape backup has passed over it, while the other account record is ahead of where the backup is and the changed information will be written to the tape.
The only way to take a consistent backup of a running database is by using specialized software that is either packaged with the database, or from a third party that is able to make a fuzzy copy consistent when it is restored. If your backup software does not have the ability to restore a consistent copy of the database from an online backup, you need to ensure that the application is in a quiesced state before taking the backup.
Several products were created with these problems in mind. They use the serialized database transaction log to make a fuzzy backup transactionally consistent. You should check with your database vendor to see if they offer similar functionality.
Navigating The Continuum With Transaction Logs
Most modern database managers keep a serialized log of the changes made by applications. This log will often store a record of before and after images of altered records, and some logs capture business transaction boundaries as well so that a database can be made transactionally consistent when it is restored.
If your RTO or RPO requirements cannot be met by restoring tape backups and applying the logs, shipping the logs to a target system and applying them to a copy of the database is a faster option. But even this takes time and if you still can’t wait, then you should investigate the use of online data replication.
Your backup database must be on disk and to obtain the lowest RTO should be connected to a system that is ready to take over from the primary. As data is changed in the primary database, the changed record images are not only written to the local transaction log, but also streamed to the backup where they are applied in real time (see figure 2).
RTO can be very close to zero, subject to the time needed after a failure to complete inserting streamed transactions into the database, back out indeterminate transactions, make the database a production instance, and bring up the application. RPO is subject to how fast the data can be sent over the communications link so that it is safely on the backup system.
There are differing methods of applying the information to the backup database. You also need to determine how long it takes for the backup database to be made available and how long it takes to bring up the application after a failure occurs.
Replicated Enterprise Storage Systems
With most, if not all, replicated enterprise storage system (ESS) solutions, data is replicated by sending track-by-track changes from the primary ESS at the primary site to a remote ESS at a backup site in either lockstep (in real time) or nonlockstep mode. When this replication occurs in lockstep mode, ESS vendors guarantee that the bits on the disks of the remote ESS are identical to those on the disks of the local ESS at any point in time (see figure 3).
One good feature of hardware replication is that even in lockstep mode, the attached computer systems are not affected by replication overhead or latency. This is because once data is written to the primary ESS by the primary computer system, the data is copied from the primary ESS to the remote ESS, often directly from ESS cache. Although, there are times, such as during a planned cutover, when you want the ESS to freeze all disk I/O to a consistent point in time-in effect, freezing the attached systems as well.
What ESS vendors do not tell you but what is extremely important to note is that the local ESS can send data only to the remote ESS that has been written to it by the local computer system, that is, from the system’s internal disk cache buffers. Modern computer systems keep (buffer) substantial amounts of disk information in an internal memory cache because frequent writes to any kind of external storage substantially slows down an application. If the data has not yet been flushed (written) from the computer system’s cache to the local ESS, it is lost during a local site failure. The bigger the buffers (for higher performance), the more data can be lost and the further the RPO is from the time of actual failure.
When the attached computer systems do flush their internal disk caches, it is not done synchronously, so files on different disks will suffer different losses of data. Parts of uncompleted logical transactions may be flushed to disk and replicated, while parts of completed transactions may not be flushed until much later. Without question, every record that was locked on the primary system’s ESS continues to be locked on the backup system’s ESS. In a nutshell, then, the contents of the ESS will be just as fuzzy as a tape backup of an open database and will not reflect the logical, consistent state of the database as the application sees it (see figure 4).
while ESS vendors tell you that disk snapshots are a way of taking backups without bringing your application down, what they may not tell you is that such snapshots are also fuzzy if there are any outstanding transactions against the database when they are taken. Software is available that will freeze new transactions while allowing transactions in progress to complete and then take a snapshot, but what happens if you have a long-running transaction? Can you afford to have your application down until it completes?
More Than Meets The Eye
We have seen that RPO can be further from the failure point when using hardware replication than when using software replication. But what about RTO? Because ESS creates a fuzzy copy of the database on the backup system, database recovery must be done using the database manager’s inherent power-on “restart” functionality. This is possible because remote recovery is similar to a local power failure in that all of the in-memory information is gone.
Before considering replicated ESS over other recovery or continuity methods, you need to understand that:
• Replication visibility does not exist at the application or transaction level.
• Active database recovery must be done on the backup system before it can begin using data on the remote ESS.
• All applications on the remote system (replicated or not) may need to be brought down for a time for the cutover to take place.
Software on the backup computer system must be capable of backing out transactions in progress to make the database consistent. And, during a failover situation, this software must be run on the backup computer before processing can begin (see figure 5).
But what if the transaction logs needed for recovery were off-loaded to tape and are no longer on the system?
Does your RTO requirement allow the time needed to bring them back online to complete the database recovery?
I am not suggesting that replicated ESS doesn’t have a place in the continuum, I am only saying that its capabilities must be carefully balanced against your RTO and RPO requirements. On the plus side, replicated ESS does not require additional processing power on the primary system like software replication does.
Recovery or continuity solutions can be made up of multiple pieces. For example, replicated ESS can be combined with software replication. While data disks can be flushed infrequently, the transaction log must be flushed to disk after a very small set of transactions because it must be accurate to recover the database disks.
Because the transaction log is written serially, the continuous flushing overhead is low. If the replication software misses transactions because the communications line is saturated, the replicated log file can be “harvested” from the backup system’s ESS.
Because the replication software is keeping track of the last log record it read, this process is extremely fast, minimizing RPO and only slightly increasing RTO from a software-only solution.
Continuous Application Availability And The “Split Brain” Problem
Backup to tape and even data replication to a standby system cannot ensure business continuity. Systems are already down, and a company is placed in reactive mode, trying to mitigate the loss. One of the most common failure modes when attempting system recovery is missing important application or control files on the backup system.
The real path to continuity is to create a disaster-tolerant environment that distributes the processing across multiple sites, removing the need for recovery entirely. When a disaster strikes, surviving portions of the environment can immediately take over processing for the failed parts, maintaining database consistency, and keeping business-critical services online without being hampered by a lengthy recovery process. When the missing computing power is restored, parts of the application are migrated back to it.
The “split brain” problem refers to the same database being written to on more than one computer system and being replicated to its partner. If the same record is updated in both databases, which one is correct? And if the communications link goes down but the systems keep running, the databases can be getting further and further out of synchronization. It’s like wearing two watches: You never really know what time it is.
If you choose to split your application across multiple sites, you should also split the database. Some logical means should be used for the splits, such as account number, geography, language, and so on. Each source system controls only its portion of the database and replicates it to another system that does not access it until a takeover is declared.
Two systems can replicate to each other, multiple systems in a ring, or whatever is required by the continuity plan (see figure 6).
Because the database is split, you never have two databases of records at the same time.
Front-end application routers can send transactions to the system handling that portion of the database; or, if a transaction comes into one node that needs to access the data on another node, it can be forwarded. Because all of the systems are running active transactions, there should be no surprises when one fails, except that the summed growth in the transaction load may have outgrown the system.
Disaster recovery through continuous application availability is a continuum. Not all business processes have the same availability and loss of work in progress requirements, and the same process at two different companies could have different requirements. While the choice of technologies used to meet your RTO and RPO requirements may cause them to impact each other, they first should be evaluated independently. You should not be choosing technology before you have done a risk analysis and business impact analysis.
And one last point: Recovery of your computing infrastructure is only one very small part of your overall disaster recovery or business continuity planning program.
Ron LaPedis, CBCP, CISSP, has been with Hewlett Packard’s NonStop Enterprise division for 22 years and is the product manager for platform security and business continuity products. He has been a Certified Business Continuity Professional since 1990 and is a Certified Information Systems Security Professional. He has taught and consulted in these fields around the world.