|
DISASTER
RECOVERY
JOURNAL
Return
to the Winter 2002
Index
P. O. Box 510110
St. Louis, MO 63151
(314) 894-0276
Fax: (314) 894-7474
Internet
www.drj.com
E-mail drj@drj.com
PUBLISHER &
EDITOR-IN-CHIEF
Richard L. Arnold, CBCP
richard@drj.com
SENIOR EDITOR
Janette Ballman
janette@drj.com
EDITOR
Michelle Saab
michelle@drj.com
COPY EDITORS
Edward H. Pearce, CBCP
drj@drj.com
Richard
Sandhofer
richards@drj.com
INTERNET /
ADVERTISING
Robert Arnold
bob@drj.com
_____________
Corporate
President/CEO
Richard L. Arnold, CBCP
richard@drj.com
Vice
President
Robert Arnold
bob@drj.com
CONFERENCE COORDINATOR
Patti Fitzgerald, CBCP
patti@drj.com
CONFERENCE REGISTRAR
Merce Knese
mercedes@drj.com
CIRCULATION
Laura Baugh
laurab@drj.com
INTERNATIONAL
CONTACTS
England: Thom Hetherington
Business Continuity
Phone: 0161-237-1007
thomh@tempus.demon.co.uk
Australia: Anthony J. Harvey
Journal of Business Continuity
Phone: 0011-613-953-0055-8
fax: 0011-613-953-0528
sector@notability.com.au
Japan: Shinji Hosotsubo
Quake Japan Co., Ltd.
Phone: 03-3215-2880
fax: 03-3215-2881
Brazil:
Jose Carlos Ferreira
Disaster Recovery Mercosul
Phone: 55
11 3666-9506
conc2000@uol.com.br
ww.drms.com.br
|
|
Click
Here for a Printable Version
DATABASE REPLICATION
Special
Challenges Over Extended Distance
By TOM FLESHER
Why replicate your
database? The simple answer is to have a hot standby copy of
your organizations most critical data in case you need it. In
case you need it for:
Business Continuity your primary copy of the data is knocked
out or is unavailable for an unacceptable length of time in other
words, an unscheduled outage.
Continuous Availability your primary copy will be unavailable
due to a planned and deliberate event, such as maintenance, migrations
or reorganizations.
Workload Balancing you can run data mining applications
against the replica copy and avoid the overhead on the primary copy.
This article will focus on special challenges encountered when contemplating
data replication over extended distances generally speaking,
distances greater than 100 miles.
Sometimes this extended distance requirement will be imposed for purely
technical reasons the primary copy resides at one location but
the replica needs to reside far away to be readily available to a group
of users at the remote location.
Another reason is readily apparent to readers of this publication
the extended distance requirement is frequently imposed as a way to
minimize risk. For some organizations the replica must be far enough
away to avoid being impacted by circumstances that might render the
primary copy unusable. In many cases, particularly in the United States,
the replica must be hundreds if not thousands of miles away from the
primary instance of the database. Natural disasters and other causes
of unscheduled outages may affect both a primary site and the backup
site if they are too close together.
Some organizations need multiple replicas. Perhaps a remote copy of
the database is needed for true disaster recovery applications, and
in addition a local copy is used for high availability or data mining
applications.
Homogeneous
vs. Heterogeneous Replication
For contingency or high availability applications, homogeneous replication
is normally used. A homogeneous replica copy is logically equivalent
to the source copy of the database. An application which runs correctly
on the source copy can run successfully using a homogeneous replica.
The application cant tell the difference.
In contrast a heterogeneous replica is different in some important attributes.
The data model may be different, the platform may be different, even
the database management system may be different. Heterogeneous replication
is typically required for data mining and data warehouse applications,
where summarizations and other data transformations are needed to present
the data in a meaningful way.
Replication
And Change Propagation Same Thing?
The broadest definition of database replication involves the establishment
of a consistent copy of a source database and keeping it reasonably
up-to-date. In the strictest sense, however, replication means making
a copy of the database at a point in time. Changes to the primary copy
after initial replication time are not reflected in the replica until
another full copy is made or unless change propagation is
also part of the configuration.
Some replication applications involve a sweep of the entire database
on a regular, perhaps daily, basis. The sweep collects all the database
information, including rows and records which have not changed since
the previous sweep. This complete image is then used to refresh the
replica copy.
If the replica is located far away from the source copy, a complete
refresh can be especially challenging. The sheer quantity of data that
must be copied will dictate how much communications bandwidth is required.
A lot of this bandwidth is wasted if unchanged data is copied when,
in fact, it doesnt need to be copied!
In common usage, however, replication is usually meant to include some
form of change propagation. For business continuity it is important
that the replica be reasonably current so that transaction loss is minimized
in case of an unscheduled outage. A daily refresh of the replica implies
that up to a days worth of transaction activity may not be reflected
in the replica. For some applications this may be acceptable, but for
any organization that relies on real-time information, it is completely
unacceptable.
Without some sort of change propagation, the only way to insure a remote
database replica is reasonably current is to replicate frequently. But
the cost and potential performance ramifications of very frequent complete
refreshes make this approach highly undesirable over extended distances.
With change propagation as part of an extended-distance replication
solution, an initial refresh (or full copy) is used to create a starting
image of the remote replica. Thereafter, changes are propagated based
on transaction activity.
A full refresh copy is typically done on an asynchronous basis. Change
propagation can be done either synchronously or asynchronously.
With fully synchronous technologies, the replica is updated totally
in sync with the primary copy. Whether this is done at the application
layer, the database layer or the physical layer, it doesnt matter.
A given business transaction causes both the primary copy and replica
to be updated simultaneously. Change propagation over extended distances
introduces special challenges that make synchronous propagation problematic.
Is Replication
The Same Thing As Mirroring?
Technically mirroring is a form of replication usually
it means synchronous replication and change propagation performed at
the physical disk layer. The disk subsystem mirrors every write to a
remote copy. The disk write command is considered complete only when
both the primary copy and the mirror are updated.
This works reasonably well over shorter distances but over longer
distances there is a problem with synchronous disk mirroring
and thats called propagation delay or latency.
The time taken to send the data over a communications link and wait
for an acknowledgement of successful receipt becomes perceptible over
extended distances. If the application depends on very fast disk service
times, it will slow down since disk throughput is being reduced.
In large-scale database systems for example, certain disk datasets associated
with the logging and commit-manager components are particularly sensitive
to propagation delays introduced by disk mirroring. You have to mirror
the logs and related datasets as well as the databases themselves if
you want to have restartability in the event of an unscheduled outage,
so this poses a serious challenge!
How Do
You Replicate Databases Over Extended Distances If Synchronous Mirroring
Doesnt Work?
There are both hardware approaches and software approaches for extended
distance database replication.
Asynchronous (i.e. non-synchronous) disk mirroring technologies have
been available for some time and continue to be improved. In these hardware-oriented
approaches, disk updates are committed at the local site, and are then
independently applied at the remote location. Normally the time delay
between the local and remote update is very small a few seconds
or less.
Because asynchronous disk mirroring is independent of the disk write
operation occurring at the source site, using it should not impact the
performance and throughput of the production application. However, if
the bandwidth provided is not sufficient to enable the technology to
keep up with the aggregate sustained write rate, the solution
may ultimately slow down disk response at the production location and
thus impact performance.
Asynchronous disk mirroring needs to have a buffer area large enough
to handle the in-flight work. If the distance between the primary site
and the backup site is very large, or if latency is high due to the
use of a satellite link, the buffer area will need to be extremely large.
An allowance also needs to be made for the occasional interruption in
the availability of the communications link. Such interruptions occur
from time to time and usually do not last long enough to warrant switching
the application from the primary site to the backup site.
In addition, there must be very precise time-sequencing to insure that
the remote replica is consistent at any given time, since the remote
replica may be needed for disaster recovery or some other unscheduled
outage scenario.
The bandwidth required for full-scale replication of a large and frequently-changed
database can be truly staggering, since all database writes and associated
logging writes must be mirrored to guarantee restartability. When a
transaction updates a database row or record, the entire physical disk
block or track must be rewritten. The rewritten track typically
50,000 bytes or more must be sent to the remote location even
if only a few bytes were changed by the updating transaction.
In effect, the WAN needs to have enough capacity to handle every disk
write over the extended distance. Thus, even with an asynchronous approach,
the bandwidth can be a real challenge, at least as far as cost of ownership
is concerned!
What Kinds Of Alternative Software-based Approaches Are Being Used?
Software-based approaches can perform the replication at a more targeted
and granular level, where only the changed data relevant to a business
transaction is conveyed to the remote location and is applied to the
replica copy. This assumes of course that an initial copy of the database
is made using an appropriate technology.
Some solutions are imbedded in the application. In one form of change
propagation, middleware message-queue software can be used to encapsulate
the essential change information generated by a business transaction.
This information is sent asynchronously to the remote location and application-specific
code applies the change to the remote replica.
Another application-specific approach involves literally queuing a message
simultaneously to both the primary site and the backup site. In effect
the transaction is processed at both sites. Typically this kind of approach
must be designed into an application when it is first developed
the costs to retro-fit for this approach are normally prohibitive.
Any application-specific replication approach involves challenges in
terms of development resources it takes more effort (and money)
to develop the application in the first place plus maintenance costs
over time are increased.
Other solutions use the change log or journal created by the database
management system itself. The log is sent as close to real-time as possible,
and changes are applied using the log this is sometime called
a log apply process.
A databases change log is a concise and time-sequenced record
of change activity occurring against a database. It consists of both
before and after images of rows and records
changes by transaction activity. These logs have to exist for basic
database backup and recovery processes to work properly. For database
replication applications, the after images are of particular
interest.
In highly optimized solutions, extraneous information in the change
log is filtered out reducing the bandwidth requirement. For example,
change information pertaining to temporary or non-critical databases
may not be needed and is thus eligible for filtering.
In one solution, staging tables are used to capture changes
deduced from the change log. These staging tables are managed at the
production site and change propagation is driven from them. The overhead
for the maintenance of staging tables (disk and CPU) is a factor affecting
cost of the solution.
Log apply can be accomplished either physically or logically. Physical
log apply simulates a database recovery changes are applied using
physical keys imbedded in the after image log data. The
remote replica is physically identical to the production copy.
Logical log apply involves conversion of the change log from its original
format to logical record or row images. The changes are applied to the
replica using SQL or other appropriate database management API.
Whether the log is applied logically or physically, it is important
that complete sets of log records be processed as a unit in order to
guarantee consistency of the replica database. The software should observe
special commit log records and apply changes only for committed transactions.
Any extended-distance replication solution should obey the principles
of atomicity, consistency, isolation, and durability the so-called
ACID qualities.
These software-based approaches typically require much less bandwidth
than hardware approaches and thus are frequently selected for extended-distance
applications.
In any software-based approach, resources are needed to manage the replication
processes, including change propagation and apply processes. These resources
must be sufficient to enable the replica to keep up with
the updates emanating from the primary site. If it takes two hours to
apply one hours worth of changes, the solution is not viable!
Conclusion
Selecting the right replication solution is an important and complex
activity. If you have an extended-distance requirement you need to address
the special challenges mentioned in this article. Good luck!
Tom Flesher (tomf@enet.com) is executive vice-president and chief technology
officer of E-Net Corporation, a California-based software company specializing
in business continuity and database replication solutions for the IBM
mainframe community. As one of the companys founders, he is responsible
for E-Nets product development and serves as primary liaison between
E-Net and its strategic business partners. He has more than 25 years
of experience in data base management systems, systems programming,
communications, and data center management. Flesher has spoken at GUIDE,
SHARE, and numerous DBMS-related user groups and conferences about database
recovery, data integrity, and remote site disaster recovery. He holds
US Patent 5,412,801 (with two co-inventors) for gap recovery technology
used in the RRDF remote journaling software product. Flesher earned
a Bachelors degree in mathematics from the College of William
and Mary in Virginia, where he serves on the Board of Directors for
the Fund for William and Mary.
n To comment on this article, go
to 1501-09 at www.drj.com/feedback.
©Copyright
2002 Systems Support Inc. All rights reserved. Reproduction in whole
or in part in any form or medium without the express written permission
of System Support Inc. is prohibited.
|