Since the replication services are being performed at the controller level, this can be considered a storage-centric solution. By being storage-centric, replication is being performed at the LUN level and has no comprehension of what a filesystem or database is.
The advantage is that this type of solution can be used in a heterogeneous server environment. Today no storage vendor supports a controller-based replication solution that works with other storage vendors.
This means that the choice on which storage vendor to use for a controller-based solution must be made very carefully, and that only that storage vendor will be able to provide controller-based replication while using that hardware.
By replicating data at the LUN level, the storage controller has the ability to perform the replication process transparently to the host and theoretically will have little or no impact on server performance.
There are two primary methodologies used for controller-based replication: PPRC and journaling storage device.
Peer to Peer Remote Copy
Peer-to-Peer Remote Copy (PPRC) is a very reliable replication technique that has been in use for several decades in the mainframe environment. Several companies, including EMC, IBM, and HDS have brought this process to the open systems environment. This process tracks changes using an ESCON-type methodology based on a logical track basis (a track is usually 32 or 64KB, depending on who is supplying the solution). Data blocks within a track that have been changed and not sent to the secondary image have that track marked, indicating the two images are not in sync. When a resynchronization is made, each track that has been marked is sent to the secondary and the mark is then cleared.
The advantage to this method is if multiple writes are sent to the same location on a LUN, only the last write is sent (which means it may be faster than the journal device method). The disadvantage to this is if the data being written is smaller than the track size or if the data written is offset so it spans multiple tracks, more data may be sent to the mirror than absolutely necessary. For example, if the database being used uses 8 KB blocks for all of its I/Os and the track resolution is 32K, the PPRC solution may have to write four times as much data as necessary to resynchronize the mirror LUN. Having to read and write a significant amount of data means that it will take longer to resynchronize and there is a potential increased bandwidth requirement between the two devices being mirrored to allow for the additional data flow.
Auxiliary Journal Device
This kind of solution uses a separate device to store the changes between the primary and secondary of a mirror pair. When the mirror pairs are suspended, each write is also sent to the journal device. There are several ways to re-synchronize between the devices. This method can be combined with the flash mirror or the sparse copy methodology to make a more effective solution. In this case, the fact that there are multiple writes and a read may negatively affect performance.
The most popular method of re-synchronization retransmits writes in the same order they were sent while the mirror pair was suspended. A major disadvantage to this method is that if multiple writes are sent to the same disk area, each one of those writes is sent again, which translates to an increased bandwidth requirement or longer re-synchronize time.
The other method sifts through the writes and only sends the last write for any one area on the disk. The advantage here is that the re-establish time can be significantly reduced and only the minimum bandwidth between the two devices is required.
Controller-Based Replication - Pros
- One of the biggest advantages of using a controller-based solution is that it does not use any additional bandwidth between the controller and the server as the replication is being done effectively at the back-end of the controller. In addition to this, there is less interconnection complexity involved and a lower potential for additional latency of the writes.
- By offloading the replication services from the server, the server is better able to do the things it should be doing - running the applications. No server cycles, memory, or I/O bandwidth is consumed on the server side when a controller-based solution is used for replication.
- A non server-based solution allows replication of a heterogeneous server environment. No longer is an IT department limited to just one operating system, and heterogeneous server support has become a basic requirement for storage and the replication methodology being used. A server-based solution would require potentially several different solutions to be selected, and the additional complexity to the environment translates to additional IT staff supporting the environment.
- Since Fibre channel has extended distance capabilities over SCSI, this solution is great for a campus solution where the distances are not excessive. Of course each supplier of this type solution has a different opinion on what a reasonable distance may be. Certainly a 10KM distance can be easily implemented, and additional hardware can be used to potentially extend the Fibre distance up to 120KM.
Controller-Based Replication - Cons
- Because this is a controller-based solution, there is no flexibility in the storage platform being used for replication. At this point in time, no storage company has been willing to support another storage platform for their controller-based replication solution. For example, if using Compaq for this solution, all storage involved in that replication project will need to be Compaq. If using EMC Symmetrix, this solution only works with their Symmetrix storage.
- If the storage requirements exceed the capacity of a single storage controller (for any number of reasons - capacity, number of LUNs or mirrors required, performance requirements, and number of connections for example), multiple storage controllers may need to become involved (with the additional management requirements).
- When extended distances are used, forcing a Wide Area Network (WAN) solution, external adapters are required to convert from ESCON or Fibre to the WAN interface being used (TCP/IP, T-3, or ATM). These adapters are costly, add complexity to the solution, and may not make full use of the available bandwidth. This means that not only does additional money be spent for the adapters, but also additional bandwidth may be required to achieve the throughput required.
- Consider the long-term cost of the solution. Some vendors require support through their own organization and do not allow the customer to manage their own site. This means that when a change is to be made it can cost a significant amount of effort and money to get this implemented. Also look at licensing costs and the ability to transfer those licenses from one controller to another.
- If an IT department has already purchased storage from multiple storage suppliers, and each of them have their own controller-based replication solution, management of each of these islands of replication can become a significant challenge. While this may seem obvious, it helps to have it pointed out to ensure that additional resources have been allocated to properly manage this.
Appliance-Based Replication - Overview
Using an appliance to provide replication is a new methodology recently introduced by several startup companies.
The logic here is that an appliance can be used to provide high-end functionality without regard to the storage subsystem underneath, while taking advantage of the heterogeneous server support only a hardware solution can provide.
The first area appliances have been focused on is in storage virtualization, and the second area is replication. There are two major implementations of appliances: symmetric and asymmetric.
Symmetric Sits in the Data Path
As the diagram shows, the appliance sits between the server and the storage. This is the first implementation of an appliance, and it adds flexibility to the solution because different attachment methodologies (Fibre, SCSI, or one of the emerging new standards) can be used - either for the front-end or the back-end. On the other hand, it can add additional latency and become a potential bottleneck to the overall IT solution. Because every I/O goes through the appliance, it can direct where each I/O needs to go. If the I/O is a write, the appliance can direct that write to all of the LUNs that need to be updated. Since there is no data path around the appliance, data integrity is fully controlled and should be considered as safe as a controller-based solution.
Asymmetric Does Not Sit in the Data Path
As this diagram shows, the asymmetric appliance does not sit in the middle of the data path. Instead it works with the host bus adapter (HBA) or software on the server to redirect the data to the appropriate storage location. The advantage for the HBA approach is that it can be implemented without adding latency to the solution. The disadvantage is that the HBA vendors need to add a significant amount of functionality to their controllers to make use of this capability. Qlogic and Emulex have announced they will support at least one implementation, but it looks to be in the relatively distant future.
Replication by using an asymmetric appliance is much more difficult to implement, as it requires a significant amount of cooperation between the HBA and the appliance (with a real concern towards error recovery and data integrity).
All that has happened is another hardware platform has been introduced to act as the traffic cop instead of the systems talking between themselves. This may be acceptable for virtualization, but does not help in replication.
As this is a hardware solution, it again replicates at the LUN level - with the same advantages and disadvantages as the controller-based solution. At present, the most popular methodology being used to track mirrors is the auxiliary journal device.
A more robust modern replication method similar to PPRC should available in the near future.
Appliance-Based Replication - Pros
- Just as the controller-based solution supports a heterogeneous server environment, so does the appliance solution.
- By replicating one level higher than the controller, a truly heterogeneous solution - at both the server and the storage level can be implemented.
- Multiple controllers can be spanned - regardless of the storage vendor. This allows for one appliance to replicate across several storage arrays - simplifying the overall solution.
- Many, if not most, IT departments have implemented a SAN for a wide variety number of reasons. Appliances support the SAN environment natively, which translates to easier integration into an existing SAN environment.
- Some appliances advertise support of a WAN solution natively, which means that additional adapters are no longer necessary (as with the controller-based solutions). Not all suppliers of appliances support this, but if looking for extended distance replication it should be a consideration.
- Management of the overall solution needs to be considered. By being able to span across multiple storage arrays and replicating regardless of the storage vendor - it should be easier to manage the replication effort. This also ties to total cost of ownership, as it is not necessary to repurchase or re-license the appliance whenever a server or storage array is replaced or moved.
Appliance-Based Replication - Cons
- By adding another layer to the overall storage solution, another layer of complexity has been added that the IT department needs to support. Added complexity is not something IT departments are looking for. The reason this is acceptable is the added complexity is offset by the increased productivity and usefulness of the appliance.
- While an asymmetric appliance does not sit in the data path, this makes it extremely suspect as to its ability to act as a reliable replication solution. Symmetric appliances may be great for storage virtualization, but have several inherent problems when being used for replication. The first problem is in recognizing there is an error and properly recovering from the error when the HBA must communicate with the traffic cop to figure out where the error is in the first place. Another problem is when a WAN solution is used - how do the WAN devices appear and what interface is used to access them?
- Since a symmetric appliance sits in the middle of the data path, it is a reasonable concern that performance may be impacted and steps should be taken to ensure the solution only minimally impacts performance.
The two major concerns that need to be addressed are latency and saturation of the appliance.
As there is another layer of complexity, and symmetric appliances have the overhead of having to receive a SCSI command, process it, and then retransmit that command to the appropriate device(s). This translates to latency and needs to be addressed. This is where good development and programming of the application will really make a difference.
As with any hardware-based solution, there is a point where the performance will level off as it can only handle a certain amount of bandwidth. Clustering may help in this area, providing it is implemented properly. This concern is similar in many respects to the bandwidth requirements for a switch, and should be treated in the same manner.
- In addition to adding complexity, hardware has been added to the equation. Hardware breaks. When looking for a solution, make sure there is no single point of failure!
There are many possible solutions for replication, and each is right for different environments.
Server-based solutions are more suited where application performance is not an issue (as it takes away server resources), storage bandwidth is not an issue (as it requires up to double the bandwidth) and the server environment is not expected to change in the future (to reduce the ongoing support costs).
Controller-based solutions are more suited where one storage chassis is being used for a solution (as it cannot span controllers), one storage vendor is being selected (as each controller-based solution does not support any other), and a SAN solution is being implemented (as there is a substantial additional cost to connect to the WAN).
Appliance-based solutions are best where complete flexibility in servers and storage platforms is necessary (as they support a heterogeneous server and storage platform), when WAN connectivity is necessary (as they can do this with less cost), and when the replication is performed spanning multiple storage controllers (as it can support multiple storage controllers in a mirrored environment).
The challenge is to determine the best solution for the environment both now and in the future. Remember to consider all of the options including server platforms, storage platform flexibility, and the total cost of ownership before selecting what is right for your environment.
Robert A. Collar is Senior Product Manager for SAN Director Products at LSI Logic Storage Systems, Inc., Milpitas, CA. He has been involved in high availability solutions off and on since 1988, working at Tolerant System, Pyramid, and as an IBM/HP/SGI reseller. He has been involved in Unix-based solutions since 1979. He recently addressed the SNW/Tokyo show in January 2001 on replication and business continuance. He can be reached at (408) 433-4076 or email@example.com.