A major UK bank has experienced no major system outages in its ATM network for 15 years. How has it achieved this remarkable availability? Through the use of an active/active network architecture.
Many enterprises in the financial, telecommunications, health, transportation, gaming, and other industries have reported similar experiences. Let’s look at how these continuously-available active/active networks work along with their advantages and pitfalls.
What Do We Mean by Continuous Availability?
First of all, we must explain what we mean by “continuous availability.” A continuously-available system is one that provides expected services to its users at all times, no matter what. No single failure will take it down. It never has to be taken out of service for planned maintenance such as upgrades. It continues to provide service even if a data center blows up.
Clearly, there will be hardware faults, software bugs, operator errors, power failures, hurricanes, floods, and terrorist attacks. No system is immune to outages. But if the recovery from an outage is so fast that no one notices, continuous availability has been achieved. In other words, “let it fail but fix it fast.” That is the philosophy behind active/active systems.
The Active/Active Architecture
Active/active networks provide continuous availability by providing multiple processing nodes that are geographically separated from each other, as shown in Figure 1 (above). A common application runs on all nodes. The processing nodes each have local access to a current copy of the application database. The database copies are kept in synchronism via data replication. Whenever a change is made to one database copy, that change is replicated instantly to the other database copies in the application network.
A transaction can be sent to any node and will be processed in the same way as it would be at any other node. Should a fault occur that causes a node to fail, all that needs to be done is to reroute transactions to a surviving node. Recovery can be accomplished in seconds to subseconds.
The data replication engine holds the key to active/active network performance and data security. There are two fundamental types of replication – asynchronous and synchronous.
An asynchronous replication engine works off of a persistent queue of changes made to the source database, as shown in Figure 2 (below). Its extractor reads changes from the change queue and transmits them to its applier at the target system, which applies them to the target database. The change queue typically already exists in most transaction-processing systems since it is created by the local transaction monitor for system recovery if necessary. Thus, asynchronous replication is accomplished “under-the-covers.” No changes need be made to the application. The application does not know that replication is occurring and is not impacted by it at all.
However, there is a delay from when a change is made to the source database and when it is applied to the target database. This delay is known as replication latency. Replication latency in today’s replication engines is typically measured in the tens or hundreds of milliseconds. Asynchronous replication latency presents two problems in active/active networks which must be understood – potential data loss and data collisions.
Potential Data Loss: Should the source system fail, any data in the replication pipeline that represents source database updates may not make it to the target database. Thus, a subsecond of data may be lost. Should the source system be recovered, these changes can be resurrected from the change queue. However, if the source system is destroyed or its database is damaged, these changes will be gone forever.
A consideration in moving to an active/active network is the amount of data loss that is tolerable to the company following a major source system failure (this is known as the RPO, or the recovery point objective). Note, however, that one second of data loss is a lot better than hours of data loss if magnetic tape or virtual tape is being used for backup.
Data Collisions: Because the application is running in multiple nodes and because there is a delay in updating the target databases, there is a possibility that the application instance in two different nodes will update the same data object at about the same time (within the replication interval). Neither node knows that the other node is also updating the same data object. Therefore, each will replicate its change to the other node and will overwrite that node’s change. Now, the databases are different and both are wrong.
In some applications, data collisions are not possible (for instance, an insert-only application) or can be tolerated (a collision will be corrected by a later update). Data collisions can be avoided if the database can be partitioned so that only one node has permission to update a given partition. If data collisions can occur and are not acceptable, they must be detected and resolved. Many of today’s replication engines provide data collision detection and can resolve data collisions via rules bound into them (for instance, accept the latest change). If a collision is detected and cannot be automatically resolved, it may have to be resolved manually.
The problems of data loss and data collisions are avoided if synchronous replication is used. With synchronous replication, either changes to all data object copies within the scope of a transaction are made or none are. The synchronous replication engine accomplishes this by first acquiring locks on all data object copies across the application network before applying any changes. It then knows that it can reliably change all of the copies of the data object simultaneously.
Because no change is made to a data object unless all copies of that data object can be changed, there is no data loss. If the source system fails, no changes are made either to the source system or to the remote target systems. Furthermore, since a data object cannot be changed unless the replication engine is holding a lock on it, it is not possible for two applications remote from each other to change the same data object at the same time. Therefore, data collisions are avoided.
However, synchronous replication comes with its own problem. Because an application must wait as locks on all of the data objects within the scope of a transaction are acquired across the application network and then must wait until the transaction is committed, the response time of the application may be degraded. To avoid this, synchronous replication engines generally impose a distance limitation on how far apart the processing nodes may be separated – typically, tens of kilometers. This may violate the disaster tolerance requirements for the system as the processing nodes must be far enough apart so that any expected regional disaster will affect only one node at most.
So far, we have described the active/active infrastructure. But an active/active network comprises active/active applications running on an active/active infrastructure. Can we take our current applications and simply move them to a distributed active/active infrastructure?
Maybe not. Some changes may have to be made. For instance, if the application automatically assigns sequential identification numbers such as invoice numbers, these may be duplicated by the different nodes. Some mechanism to ensure uniqueness may have to be implemented. For instance, invoice numbers could be preceded by a node number. Automatically-assigned identification numbers are just one example of memory-resident context. Any application state that is kept in memory and not on disk may not be replicated, and application modifications may be necessary to handle these situations.
Some replication engines do not replicate read-only locks. This may make intelligent locking protocols ineffective in a distributed environment. Global mutexes may have to be implemented in which key locks are managed by a master node.
It must be ensured that batch runs initiated by an application are not duplicated on the nodes. Often, an application will kick off mini-batch runs as the day progresses. It is important that these batch runs occur on only one node.
A strategy for moving users from a failed node to surviving nodes or for rerouting transactions to surviving nodes must be established. There are three basic types of redirection strategies:
- Client redirection, in which the client system detects that its transactions are not being processed. The client system can then connect to another node for further processing. This requires that the client has the intelligence to detect failures and to switch IP addresses in the event of a failure.
- Network redirection, in which routers reroute virtual IP addresses being used by the client systems to surviving nodes following a node failure. This is a common alternate routing feature found in intelligent routers.
- Server redirection, in which the nodes monitor each other via heartbeats and command the movement of client systems from a failed node to surviving nodes.
The reliability of the replication network is paramount. Should the replication link between two nodes fail and no further action taken, each node will continue to process those transactions routed to it but will be unable to replicate those changes to the other nodes. This is called split-brain mode. The database contents will diverge, and transactions will be executed against stale data. When the replication link is restored, the databases can be synchronized by draining the changes that have built up in the change queues. However, data collisions are bound to happen during the restoration process.
In some applications, split-brain operation may be acceptable. If it is not, one of the nodes must be taken out of service until replication can be restored. For this reason, the replication links should always be redundant and independent so that no single failure can prevent replication.
Other Advantages of Active/Active Networks
In addition to providing continuous availability via fast and reliable failover, active/active networks provide many other advantages:
- Planned downtime is eliminated since upgrades can be rolled through the application network one node at a time.
- It is easy to add capacity by simply adding nodes. Bigger systems do not have to be purchased.
- Load can be easily redistributed to balance the network by reassigning users to nodes.
- Performance can be improved by stationing processing nodes close to communities of users.
- Failover can be easily tested since it is fast and reliable (a major reason for failover faults is that failover testing with active/backup systems is expensive and risky and therefore is often not done).
- Lights-out operations can be easily supported since the pressure to recover a failed node is significantly reduced.
- All purchased capacity is usable. There is no idle backup system sitting around waiting to take over. However, there must be enough capacity to handle the load if a node fails. In a two-node network, this means that both nodes must be able to handle the full capacity, similar to an active/backup system (though the load will be split between the two nodes, thus improving performance and providing additional peak processing capacity if needed). However, if there are four nodes in the network, each must be able to carry only one-third of the traffic in order to survive a single node failure. Thus, only 4/3 of the required capacity must be purchased rather than twice the capacity.
Other Costs of Active/Active Networks
The hardware, staffing, and site requirements necessary to build an active/active network are substantially the same as those required to build an active/backup system. However, there are additional costs that must be considered.
- Redundant and independent replication links must be provided to connect the nodes in the application network.
- A replication engine must be licensed.
- Additional licensing costs may be incurred since duplicate hardware and software are being used simultaneously rather than one set being used only for standby purposes.
- Applications may have to be modified to make them active/active ready.
- Distributed system management tools may have to be licensed and staff trained in their use.
- The new active/active environment must be thoroughly tested and operation procedures thoroughly documented and practiced before putting the new system into production.
However, these additional costs must be evaluated in light of the downtime costs that will be saved by going to a continuously-available environment. For instance, if downtime costs a large company $100,000 per hour, and if the active/active network can save eight hours of downtime per year, savings in downtime costs add up to $800,000 per year. This can cover a major move to active/active.
Relationship to BC/DR
Once we get our active/active network up and running and believe that we now have truly continuous availability, can we forget about business continuity and disaster recovery planning? Absolutely not!
For one thing, IT is only one element in the BC/DR plan. The plan must cover all business processes. For another, continuous availability is relative. It really means that the probability of system failure is extremely small – perhaps not even measurable. But it is not zero. There is some very small chance that an event beyond our comprehension can take down the entire system. Though this might not happen for hundreds of years, it could happen tomorrow.
Case in point – the recent botched virus upgrade by McAfee on April 21, 2010, took down thousands of systems all over the world. If we weren’t smart enough to roll upgrades such as this through our system one node at a time, we could have lost all of the nodes in our active/active network.
BC/DR planning is as important as it ever was.
The continuous availability of active/active networks can bring many benefits to an enterprise, from the savings of downtime cost to regulatory compliance, not to mention the dreaded CNN moment when the company makes international headlines following a major system failure.
There are many concerns that must be evaluated such as the problems of data collisions, data loss, and split-brain mode; the effort and risk associated with modifying old legacy applications; and the additional costs incurred. These concerns must be balanced against the many advantages of active/active networks such as the elimination of the cost of downtime, improved user experience, and regulatory compliance. The bottom line is that many companies have found that their investment in active/active technology has paid off many-fold.
Dr. Bill Highleyman is the managing editor of the Availability Digest (www.availabilitydigest.com), which focuses on the technologies of continuous availability. He is also co-author of the three-volume series on active/active networks titled “Breaking the Availability Barrier.”