|
Leveraging
Geographically Diverse Data Centers for Production and Disaster
Recovery
By JEFFREY L. COX
Disaster recovery (DR) plans and other business drivers
often lead to multiple data centers outfitted with additional
equipment. This approach requires significant and ongoing
investments in real estate, computing, storage, and networking infrastructures that
may not contribute effectively as possible to the DR, production,
development, and test capabilities within the business. Why don’t
companies leverage their entire distributed infrastructure
for all functions? There have historically been at least two roadblocks
to such plans.
First, it has been prohibitive to secure sufficient bandwidth connecting
the sites to maintain acceptable performance and enable the
other uses.
Second, software to allow near real-time asynchronous replication
of large amounts of data across multiple platforms has not been widely deployed.
Recent improvements in both connectivity and storage are poised to maximize
the use of these distributed assets. On the first issue, improvements
in telecommunications equipment have made it possible to remove
the bandwidth constraints between locations. By constructing cost-effective
dedicated DWDM networks, companies can realize dramatically increased
quantities of bandwidth and significant improvements in reliability.
On the second issue, storage and software vendors continue
to refine their technology and now provide replication with
RPOs and RTOs capabilities approaching zero. By combining
these capabilities, data center resources can be leveraged
to their fullest extent.
Benefits of Leveraging Distributed Resources
Driven by economics, many CIOs are refocusing efforts on extracting
additional efficiencies and maximizing the use of existing infrastructures.
In addition, due to the increasing occurrences of man-made and natural
disasters, as well as recommendations by industry experts, and new regulations,
organizations are re-evaluating the true capability of their disaster
recovery, business continuity and high availability solutions. Naturally,
organizations would like to use existing facilities in remote locations
where possible or locate new facilities as far away as practical. However,
the implications of such distant sites on applications, replication
performance and bandwidth requirements historically have forced organizations
to make significant compromises in their ability to leverage these distributed
resources.
Having the flexibility to dynamically use IT assets whenever and however
the business demands would lead to significant efficiency gains. This
fact has been the driving force behind revolutionary ideas such as LANs,
SANs, clustering, and distributed computing technologies from major
vendors. With this flexibility, applications and functions can be executed
wherever the resources are available on a moment-by-moment basis. Simultaneously,
the distributed geography of such a design can provide for a decidedly
more robust DR/BC/HA environment.
After analysis, a common conclusion made by many organizations is that
cost-effective and robust usage of distributed resources in this manner
simply cannot be done due to bandwidth constraints. In this article,
we will explore improvements in connectivity and storage solutions that
promise to make such scenarios cost justifiable to the business.
Bandwidth
The amount of bandwidth required between the centers is often hard to
quantify and not having enough bandwidth is always a risk. Adding bandwidth
after the fact has traditionally been very slow and expensive. Initial
project budgets often severely restrict the amount of bandwidth that
can be procured, thus forcing many solutions to be scaled back to levels
that begin to question original design concepts. The cost of bandwidth
between facilities has led to a process of “bandwidth rationing”
where bandwidth is treated as a scarce resource that must be parceled
out in tiny increments so as to allow each application to have its “share.”
Naturally, such a process leads to “bandwidth exhaustion”
where the free flow of data between centers does not occur at the rates
required for the flexible use of distributed data centers.
Conversely, bandwidth is not generally rationed within data centers,
as it is cost-effective to simply add SAN/LAN capacity through switch/server/disk
upgrades. Quantitatively, a large data center may have cumulative switching
capacities on the order of hundreds of GB/s (GigaBytes/second) via hundreds
of SAN/LAN ports. Relative to WANs, there is “bandwidth abundance”
within data centers.
To achieve similar connectivity over larger distances the Wide Area
Networks (WAN) between cities must grow to similar scales. This requirement
is in stark contrast to what most companies lease in WAN capacity today.
Even between the largest distant data centers, the largest circuit in
use is an OC-48 (2.5Gb/s or ~250MB/s) or in a few rare cases an OC-192
(10Gb/s or ~1GB/s). In most other environments, bandwidths of a much
smaller scale are in use (OC-3s, DS-3s, and even T1s). In any event,
even the largest of these does not achieve the goal of making the WAN
the same scale as the SAN/LAN. The cost of the WAN bandwidth has been
the largest factor inhibiting deployments of the necessary scale. The
mechanisms used to solve this problem are addressed in the sections
below.
SAN Replication Over Distance
A prominent tool in DR/BC planning is the movement of data from one
location to another via server-based, network-based and storage-based
replication. The basic idea is that having multiple copies of data,
locally and remotely, will protect the overall environment if something
should happen to the primary copy of the data. Many major installations
replicate data between two or more SAN disk systems located “regionally”
(within the same building or metropolitan area). But, corporations are
realizing that the proximity of this replicated data is not providing
sufficient protection for the business.
Synchronous Replication
Much of the replication activity between SAN disk systems utilizes a
synchronous write mechanism to make copies of data. An example of a
synchronous replication mechanism is illustrated in Figure 1. While
the synchronous approach ensures data consistency in both locations,
it only performs well when the interconnect latency (the delay in the
path between the primary and secondary storage) is approximately 1ms
or less. This is due to the fact that, in the synchronous approach,
the host is blocked waiting for the entire synchronous write to complete
before proceeding with the next write I/O. Typically, a local write
I/O (between the host and the primary storage) can complete in 1-3ms
without replication. With replication, the interconnect latency and
the I/O completion time on the secondary storage must be added to the
local I/O completion time. As the interconnect latency increases, the
host must wait longer before receiving a completion status, which reduces
application throughput. For applications that rely on sequential write
ordering to maintain consistency (typical of a transaction environment),
the latency limitation makes synchronous replication impractical over
large distances. This is particularly true for the cases where interconnect
latencies are significantly greater than those induced by metropolitan
scale distances of about 60 miles (about 1ms latency). That is, for
longer distances, the interconnect latency dominates the overall I/O
completion time.
When the secondary storage is in a geographically remote location thousands
of miles away, the performance of synchronous replication will be impacted
proportional to the distance traversed. As an example, consider a primary
site located in New York City and a back-up site located in Dallas,
Texas. Common fiber routes would traverse approximately 2,500 miles
between these locations and would inherently induce about 40ms of round-trip-time
(RTT) latency. This latency is unavoidable due to the speed of light
propagation in the fiber. It is important to note that while most applications
would continue to function with write I/O completion times on the order
of 40ms, the application’s performance would suffer to the point
of being unusable.

Asynchronous Replication
In order to effectively perform replication over large distances, numerous
asynchronous replication mechanisms have been introduced over the last
few years. An example of an asynchronous mechanism is illustrated in
Figure 2. In a departure from synchronous replication, the illustrated
asynchronous approach allows a write I/O to complete to the host (step
3) without waiting for acknowledgements from the secondary storage.
This allows the host to proceed with the next sequential write I/O while
the asynchronous replication mechanism completes in the background.
To insure data consistency, all major storage vendors offer asynchronous
solutions with write integrity features.

Private DWDM Networks
When primary and secondary storage systems are local (within the same
building), interconnects are formed via fiber patches to ports on local
SAN/LAN equipment with minimal costs. Interconnect costs increase significantly
when the second site is within a metropolitan area (<100 miles);
but those costs can be addressed very effectively by metro DWDM systems.
Metro DWDM systems have gained significant popularity over the last
few years with more than 500 deployments nationwide. Once the metro
DWDM system is installed on the fiber, the user has a dedicated MAN
infrastructure approaching or equaling the capability of their local
SAN/LAN. A large metro DWDM system would easily be capable of transporting
30 GB/sec (320Gb/sec) more than about 60 miles via a variety of network
interfaces including Gigabit Ethernet, 1 & 2 G FC and FICON, OC-48
SONET, 10GigEthernet, and OC-192 SONET.
Private DWDM WANs
Traditionally, when connecting facilities over even greater distances,
organizations were forced to lease expensive WAN circuits from carriers.
Carriers utilize their shared SONET, metro, and intercity DWDM infrastructure
to provide the circuit service. Carrier networks use technologies that
were developed for extremely large shared infrastructures with ROIs
measured in decades and require large-scale highly specialized operational
field forces. As such, intercity DWDM equipment developed for carriers
is simply not suitable for a dedicated deployment in an enterprise environment.
However, recent improvements in DWDM hardware and software have created
a new class of carrier-grade DWDM systems supporting distances over
thousands of miles and are operationally suited for enterprise deployments.
Like the metro systems, these new inter-city DWDM systems can be equipped
to transport dozens of GB/s of data and offer a wide variety of interfaces.
Some equipment includes the buffer-to-buffer credit extension necessary
to support the SAN protocols natively (without external conversion to
IP or SONET) over extreme distances. Additionally, some DWDM equipment
only requires a single fiber instead of a fiber pair thus reducing the
fiber expense.
With this new breed of intercity DWDM system the WAN bandwidth can be
cost-effectively scaled to the same degree as SAN/LAN/MAN infrastructures.
Since this dedicated infrastructure is controlled by the enterprise,
new capacity can be provisioned very rapidly (in minutes) and the reliability
and security of the enterprise network is greatly enhanced.
Intercity Dark Fiber
To make intercity DWDM systems possible, a dark fiber path must exist
that can connect the various facilities. Dark fibers are installed optical
strands that have no equipment attached to them at any location. While
many organizations may not have experience in dealing with dark fiber,
DWDM solution providers that are enterprise focused can assist in fiber
planning and acquisition.
With the large carrier fiber deployments in the late 1990s, intercity
fiber has become cost-effective and readily available from numerous
competing providers. Though dark fiber does not exist everywhere, every
major metropolitan area is connected through multiple providers and
the majority of smaller metropolitan areas are also covered. Due to
intercity fiber routes that typically follow railways, interstates,
and pipelines, many small cities and towns are also covered.
Manageability of Intercity DWDM Equipment
Important consideration must be given to the ongoing support and management
of a deployed intercity DWDM system. Management is a critical area where
carrier-oriented intercity DWDM systems differ significantly from those
designed for enterprises. The selection of an enterprise-oriented intercity
DWDM system will ensure that the operational staff has a familiar user
interface, can use existing management protocols and will integrate
with existing management tools.
Since intercity DWDM systems inherently include amplification equipment
at intermediate sites, the break/fix process is likely to be slightly
different than the traditional “in the data center” procedures.
While at first glance this may appear to be a daunting challenge, there
are numerous network service organizations that have national and global
coverage providing network monitoring, diagnosis, remote site support
and depot sites for break/fix, all necessary to maintain strict SLAs.
These functions may be provided as part of a turnkey service through
the hardware vendor or may be independently contracted.
Managed Network Services
Alternatively, the entire system can be outsourced as a package to a
service provider. Naturally, this costs more than simply hiring a network
service organization. But some organizations feel more comfortable with
the service provider model. Many metro DWDM deployments have used this
model.
Summary: Benefits of Bandwidth Abundance
By using a combination of synchronous/asynchronous replication techniques
and metro/intercity DWDM systems, enterprises can realize the benefits
of having truly integrated geographically diverse data centers. This
powerful combination enables the simultaneous use of data centers for
production, development, testing, and DR/BC purposes. Applications can
be rapidly shifted from one facility to another, and data can be asynchronously
replicated to achieve near real-time RPOs and RTOs for DR. Other functions
that are enabled include remote tape, tape consolidation, datacenter
consolidation, and the transport of LAN, voice, and video services.
In addition to significantly increasing the efficiency of an operation,
such an infrastructure sets the stage to accommodate new advances in
distributed computing architectures.
Jeffrey L. Cox has more than 20 years experience in designing and operating
some of the largest national and global scale computing and network
infrastructures for enterprises and carriers. Cox currently works as
the chief systems architect at Celion Networks, Inc. where he is responsible
for overall system design of DWDM transport systems.
To comment on this article, go to 1702-12 at www.drj.com/feedback.
©Copyright
2004 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
«BACK
to the Articles Index
|