Top-5 Causes of Replication Failure
- Published on November 19, 2007
Five Most Common Causes of Replication Failure
1) Secondary Environment Not Ready for Failover
Establishing a secondary environment that contains an operable copy of a company’s most critical applications and data sounds like the perfect solution. In reality, it is very difficult to maintain two nearly identical environments that may be located hundreds or thousands of miles apart. In order for a company to failover from a primary application to a replicated standby server, all software, patches, and configurations need to be consistent or failover will most likely not function properly. As an extreme case, we have seen companies whose self-managed replication process failed when it turned out that the secondary server that they depended on as a safety net, had been quietly repurposed to another task and was no longer available to failover. In these cases, nobody at the company realized the servers were gone until they tried to failover and it didn’t work! In many more cases, the failover process is unsuccessful because the primary and secondary environment fall out of sync as changes are made inconsistently to the primary and secondary environment.
If using this form of replication, it is critical for a company to take the time necessary to develop processes to control the introduction and distribution of changes and updates to both environments. Additionally, it is important to monitor critical processes that impact the readiness of the secondary environment to ensure system readiness if a situation occurs that requires its use.
2) Manual Error in Failover Process
The reality is we are only human and not surprisingly people in a crisis are often the weakest link in a failover process. It is not uncommon for manual errors to be introduced during a failover sequence which can corrupt the entire process. After remedying the problem resulting from the error, the entire failover process must be restarted. When it takes 350 steps to failover a typical 10 server Microsoft Exchange environment, you can imagine the likelihood for a single error is very high. According to Gartner Group, only 20 percent of system failures are due to hardware, operating system, or environmental issues. The 80 percent majority are due to either application errors or human error.
It is important that companies recognize this potential for errors and look for solutions that minimize or eliminate the need for lengthy manual processes. Where processes can be automated, the possibility for errors can be greatly reduced – especially in a time of crisis.
3) Experts Not Available During Crisis
It’s good to be needed, but too often failover processes are dependent on expert staff that may not be available during a crisis. Because the failover process impacts the full range of technical disciplines from hardware and operating systems, to applications and databases, and networking and security, it may be necessary for highly specialized personnel in different areas take responsibility for their portion of the failover process. If even one of these trained individuals is not available, the failover process can break down.
Companies need to develop a list of required skills and resources available for each skill needed in the failover process. Better yet, look for new solutions that automate many or all of the failover steps or deliver these steps as a remotely managed service – without requiring an expert staff member to trigger the failover process. If your failover process depends on a few key people to work successfully, you may not be able to failover during a severe crisis.
4) Failover Process Unable to Scale
The old “snowball” effect can impact large organizations in the midst of a failover effort with a limited technical staff required to manage large numbers of systems. It is not unusual for a critical application failure or facility problem to result in dozens or more systems that require failover – and multi-server failover is a serial, manual process. One administrator can only failover one server at a time, and many mutually dependent systems will not work until the entire environment has failed over. The delays are unavoidable and the pressure to rapidly restore service is unrelenting – not a great situation to be in for technical staff working in a crisis situation. Complex failovers can take many, many hours.
There are not many easy options to remedy the potential scaling issue in the failover process, but having as many trained staff on hand and trying to minimize system dependencies can help reduce the delays of significant failover situations. Likewise, automation can dramatically shorten the failover process by parallelizing important steps and more quickly and accurately executing complex failover scenarios.
5) Untested Failover Assumptions Don’t Work
Optimism and faith are wonderful character traits, but not when the health of the business depends on it. Unfortunately, many companies have invested in complex, multi-server failover solutions that they have found too sensitive to actually test. Without testing every permutation of systems and failure causes, it is impossible to know exactly what will happen during a real crisis. Large, complex environments can have many failure types and scenarios, and multi-server failover involves constantly changing conditions.
It is highly desirable to have a failover solution that allows the organization to test the functionality and pinpoint different failures caused by different behaviors. Some options to consider include incorporation of “what-if” scenario planning sessions and “pre-mortems,” a form of role playing that allows a technical staff to identify untested failover scenarios and potential bottlenecks.
The Impact of Replication Failure
Replication is a widely used solution for application availability, and it is important for companies to recognize the potential for failure and likely ramifications when self-managed replication systems inevitably do fail. We have discussed the five most common causes for replication failure, but the impact on the enterprise is also important to understand. First, replication adds complexity: an enterprise will need to double the amount of hardware and software of a stand-alone application and will also require additional bandwidth to handle the traffic to the secondary environment. The complex architectures and process required to support replication add significant new monitoring, management, and crisis response demands.
Second, replication can lead to system problems that are very complex to resolve.
If not managed properly, replication problems can have widespread impact and can lead to other technical issues such as database corruption. Finally, replication failures can be costly due to the time and effort required to recover the data and restore it, and the application downtime that clearly impacts the company’s bottom line and its business operations.
Unreliable Replication is More Risky Than No Replication at All
The key to successfully implementing a replication solution is to understand and eliminate as many risks as possible. Knowing your own organization, and its particular needs and technical resources, will help determine which of these risks is most likely to impact your company. The good news is that new solutions are now available that have been designed to eliminate the technical complexities and risk for human error inherent in many replication solutions. This new breed of application availability solutions automate the failover process so that it is fast, easy, scalable and not prone to manual error.
Most importantly, companies are now delivering replication solutions as “managed failover services” that provide not just automation, but remote control and management of the full failover process. These systems provide constant self-monitoring of the primary and secondary environments and the replication queues to provide a “green light” that your safety net is available, and that failover will succeed if initiated. Very importantly, this monitoring is done from a third-party location, away from your primary and secondary environment and off of your network so availability to the outside world can be accurately judged.
Finally, in evaluating managed failover services, it’s important to look for solutions that include 24x7 automated and live monitoring, and the ability to delegate the execution of the failover process at a moment’s notice if your staff is unable to reach the network to execute the failover. With a reliable service, a failover can be executed remotely with a simple phone call. With these new application availability solutions, a company can always count on their secondary environment being ready to go in the event of an emergency – big or small.
A New Era in Replication Solutions
While replication solutions represent a powerful way to protect a company’s most critical data and applications, it is important to understand the potential for replication failure and the risks associated with these approaches. As new managed failover services are coming to market, we are seeing new ways to improve on the benefits of replication with automated processes that reduce risks, improve the time-to-failover speed, and add a higher level of reliability. It’s good news for the enterprise: new application availability solutions that can help keep businesses driving smoothly – with no need to worry about potholes anymore.
Paul D’Arcy is vice president of marketing for MessageOne, a leading provider of business continuity solutions. With more than 10 years of technology marketing experience, he holds an MBA from Harvard University and a bachelor’s degree from Wesleyan University. For more information visit www.messageone.com or call (512) 652-4500.