Benefits of Leveraging Distributed Resources
Driven by economics, many CIOs are refocusing efforts on extracting additional efficiencies and maximizing the use of existing infrastructures. In addition, due to the increasing occurrences of man-made and natural disasters, as well as recommendations by industry experts, and new regulations, organizations are re-evaluating the true capability of their disaster recovery, business continuity and high availability solutions. Naturally, organizations would like to use existing facilities in remote locations where possible or locate new facilities as far away as practical. However, the implications of such distant sites on applications, replication performance and bandwidth requirements historically have forced organizations to make significant compromises in their ability to leverage these distributed resources.
Having the flexibility to dynamically use IT assets whenever and however the business demands would lead to significant efficiency gains. This fact has been the driving force behind revolutionary ideas such as LANs, SANs, clustering, and distributed computing technologies from major vendors. With this flexibility, applications and functions can be executed wherever the resources are available on a moment-by-moment basis. Simultaneously, the distributed geography of such a design can provide for a decidedly more robust DR/BC/HA environment.
After analysis, a common conclusion made by many organizations is that cost-effective and robust usage of distributed resources in this manner simply cannot be done due to bandwidth constraints. In this article, we will explore improvements in connectivity and storage solutions that promise to make such scenarios cost justifiable to the business.
Bandwidth
The amount of bandwidth required between the centers is often hard to quantify and not having enough bandwidth is always a risk. Adding bandwidth after the fact has traditionally been very slow and expensive. Initial project budgets often severely restrict the amount of bandwidth that can be procured, thus forcing many solutions to be scaled back to levels that begin to question original design concepts. The cost of bandwidth between facilities has led to a process of “bandwidth rationing” where bandwidth is treated as a scarce resource that must be parceled out in tiny increments so as to allow each application to have its “share.” Naturally, such a process leads to “bandwidth exhaustion” where the free flow of data between centers does not occur at the rates required for the flexible use of distributed data centers.
Conversely, bandwidth is not generally rationed within data centers, as it is cost-effective to simply add SAN/LAN capacity through switch/server/disk upgrades. Quantitatively, a large data center may have cumulative switching capacities on the order of hundreds of GB/s (GigaBytes/second) via hundreds of SAN/LAN ports. Relative to WANs, there is “bandwidth abundance” within data centers.
To achieve similar connectivity over larger distances the Wide Area Networks (WAN) between cities must grow to similar scales. This requirement is in stark contrast to what most companies lease in WAN capacity today. Even between the largest distant data centers, the largest circuit in use is an OC-48 (2.5Gb/s or ~250MB/s) or in a few rare cases an OC-192 (10Gb/s or ~1GB/s). In most other environments, bandwidths of a much smaller scale are in use (OC-3s, DS-3s, and even T1s). In any event, even the largest of these does not achieve the goal of making the WAN the same scale as the SAN/LAN. The cost of the WAN bandwidth has been the largest factor inhibiting deployments of the necessary scale. The mechanisms used to solve this problem are addressed in the sections below.
SAN Replication Over Distance
A prominent tool in DR/BC planning is the movement of data from one location to another via server-based, network-based and storage-based replication. The basic idea is that having multiple copies of data, locally and remotely, will protect the overall environment if something should happen to the primary copy of the data. Many major installations replicate data between two or more SAN disk systems located “regionally” (within the same building or metropolitan area). But, corporations are realizing that the proximity of this replicated data is not providing sufficient protection for the business.
Synchronous Replication
Much of the replication activity between SAN disk systems utilizes a synchronous write mechanism to make copies of data. An example of a synchronous replication mechanism is illustrated in Figure 1. While the synchronous approach ensures data consistency in both locations, it only performs well when the interconnect latency (the delay in the path between the primary and secondary storage) is approximately 1ms or less. This is due to the fact that, in the synchronous approach, the host is blocked waiting for the entire synchronous write to complete before proceeding with the next write I/O. Typically, a local write I/O (between the host and the primary storage) can complete in 1-3ms without replication. With replication, the interconnect latency and the I/O completion time on the secondary storage must be added to the local I/O completion time. As the interconnect latency increases, the host must wait longer before receiving a completion status, which reduces application throughput. For applications that rely on sequential write ordering to maintain consistency (typical of a transaction environment), the latency limitation makes synchronous replication impractical over large distances. This is particularly true for the cases where interconnect latencies are significantly greater than those induced by metropolitan scale distances of about 60 miles (about 1ms latency). That is, for longer distances, the interconnect latency dominates the overall I/O completion time.
When the secondary storage is in a geographically remote location thousands of miles away, the performance of synchronous replication will be impacted proportional to the distance traversed. As an example, consider a primary site located in New York City and a back-up site located in Dallas, Texas. Common fiber routes would traverse approximately 2,500 miles between these locations and would inherently induce about 40ms of round-trip-time (RTT) latency. This latency is unavoidable due to the speed of light propagation in the fiber. It is important to note that while most applications would continue to function with write I/O completion times on the order of 40ms, the application’s performance would suffer to the point of being unusable.

Asynchronous Replication
In order to effectively perform replication over large distances, numerous asynchronous replication mechanisms have been introduced over the last few years. An example of an asynchronous mechanism is illustrated in Figure 2. In a departure from synchronous replication, the illustrated asynchronous approach allows a write I/O to complete to the host (step 3) without waiting for acknowledgements from the secondary storage. This allows the host to proceed with the next sequential write I/O while the asynchronous replication mechanism completes in the background. To insure data consistency, all major storage vendors offer asynchronous solutions with write integrity features.

Private DWDM Networks
When primary and secondary storage systems are local (within the same building), interconnects are formed via fiber patches to ports on local SAN/LAN equipment with minimal costs. Interconnect costs increase significantly when the second site is within a metropolitan area (<100 miles); but those costs can be addressed very effectively by metro DWDM systems. Metro DWDM systems have gained significant popularity over the last few years with more than 500 deployments nationwide. Once the metro DWDM system is installed on the fiber, the user has a dedicated MAN infrastructure approaching or equaling the capability of their local SAN/LAN. A large metro DWDM system would easily be capable of transporting 30 GB/sec (320Gb/sec) more than about 60 miles via a variety of network interfaces including Gigabit Ethernet, 1 & 2 G FC and FICON, OC-48 SONET, 10GigEthernet, and OC-192 SONET.
Private DWDM WANs
Traditionally, when connecting facilities over even greater distances, organizations were forced to lease expensive WAN circuits from carriers. Carriers utilize their shared SONET, metro, and intercity DWDM infrastructure to provide the circuit service. Carrier networks use technologies that were developed for extremely large shared infrastructures with ROIs measured in decades and require large-scale highly specialized operational field forces. As such, intercity DWDM equipment developed for carriers is simply not suitable for a dedicated deployment in an enterprise environment. However, recent improvements in DWDM hardware and software have created a new class of carrier-grade DWDM systems supporting distances over thousands of miles and are operationally suited for enterprise deployments. Like the metro systems, these new inter-city DWDM systems can be equipped to transport dozens of GB/s of data and offer a wide variety of interfaces. Some equipment includes the buffer-to-buffer credit extension necessary to support the SAN protocols natively (without external conversion to IP or SONET) over extreme distances. Additionally, some DWDM equipment only requires a single fiber instead of a fiber pair thus reducing the fiber expense.
With this new breed of intercity DWDM system the WAN bandwidth can be cost-effectively scaled to the same degree as SAN/LAN/MAN infrastructures. Since this dedicated infrastructure is controlled by the enterprise, new capacity can be provisioned very rapidly (in minutes) and the reliability and security of the enterprise network is greatly enhanced.
Intercity Dark Fiber
To make intercity DWDM systems possible, a dark fiber path must exist that can connect the various facilities. Dark fibers are installed optical strands that have no equipment attached to them at any location. While many organizations may not have experience in dealing with dark fiber, DWDM solution providers that are enterprise focused can assist in fiber planning and acquisition.
With the large carrier fiber deployments in the late 1990s, intercity fiber has become cost-effective and readily available from numerous competing providers. Though dark fiber does not exist everywhere, every major metropolitan area is connected through multiple providers and the majority of smaller metropolitan areas are also covered. Due to intercity fiber routes that typically follow railways, interstates, and pipelines, many small cities and towns are also covered.
Manageability of Intercity DWDM Equipment
Important consideration must be given to the ongoing support and management of a deployed intercity DWDM system. Management is a critical area where carrier-oriented intercity DWDM systems differ significantly from those designed for enterprises. The selection of an enterprise-oriented intercity DWDM system will ensure that the operational staff has a familiar user interface, can use existing management protocols and will integrate with existing management tools.
Since intercity DWDM systems inherently include amplification equipment at intermediate sites, the break/fix process is likely to be slightly different than the traditional “in the data center” procedures. While at first glance this may appear to be a daunting challenge, there are numerous network service organizations that have national and global coverage providing network monitoring, diagnosis, remote site support and depot sites for break/fix, all necessary to maintain strict SLAs. These functions may be provided as part of a turnkey service through the hardware vendor or may be independently contracted.
Managed Network Services
Alternatively, the entire system can be outsourced as a package to a service provider. Naturally, this costs more than simply hiring a network service organization. But some organizations feel more comfortable with the service provider model. Many metro DWDM deployments have used this model.
Summary: Benefits of Bandwidth Abundance
By using a combination of synchronous/asynchronous replication techniques and metro/intercity DWDM systems, enterprises can realize the benefits of having truly integrated geographically diverse data centers. This powerful combination enables the simultaneous use of data centers for production, development, testing, and DR/BC purposes. Applications can be rapidly shifted from one facility to another, and data can be asynchronously replicated to achieve near real-time RPOs and RTOs for DR. Other functions that are enabled include remote tape, tape consolidation, datacenter consolidation, and the transport of LAN, voice, and video services. In addition to significantly increasing the efficiency of an operation, such an infrastructure sets the stage to accommodate new advances in distributed computing architectures.
Jeffrey L. Cox has more than 20 years experience in designing and operating some of the largest national and global scale computing and network infrastructures for enterprises and carriers. Cox currently works as the chief systems architect at Celion Networks, Inc. where he is responsible for overall system design of DWDM transport systems.




