Step 1: Learning Normal Versus Abnormal System Behavior Over Time
One way to try and avoid costly downtime is to monitor your system’s behavior. But that is not without its own pitfalls. If not set up properly, monitoring can easily lead to misdiagnosis, frequent false alarms, and misguided corrective action, resulting in unplanned or extended periods of downtime.
For example, if the average number of e-mails the system receives in a working week is 10 messages per second, the administrator may set monitoring system alerts based on a fixed threshold of 12 messages per second. But every Monday at 9 a.m. the e-mail system may receive 15 messages per second due to “normal” enterprise-wide user behavior (i.e., many employees first log in and check e-mail). As a result you would always get an alert at 9 a.m. on Monday that is actually a reflection of normal system behavior. Fairly quickly, the e-mail system administrator will deduce this as “normal behavior,” ignore it, and as a consequence may miss legitimate alerts for real problems.
There is a clear need for predictive, degenerative monitoring of e-mail application specific attributes for helping operators to proactively keep their primary system from going down in the first place.
One solution is a system that manages the DR process from a holistic viewpoint rather than as a point solution and collects intelligence information on system behavior. The system then performs time-series behavioral analysis and reporting. Anything required to deliver the application’s functionality to the end user is monitored, giving the IT manager a top-level, at-a-glance view of the overall health of the primary service to end users. Rather than using predetermined, fixed thresholds to generate alerts, an holistic system learns normal system behavior as a function of time of day and day of week, and only alerts when behavior departs from normal for that period of time.
Step 2: Have a Secondary Computing Environment More Than 60 Miles Away
In the event of a facilities or site disaster, you still need to keep communicating to continue business operations. The continued operation of your e-mail system becomes even more critical. You need to be able to access your system and manage it from anywhere at any time. Key to this is the establishment of a secondary environment at a remote distance from your main IT center. A minimum of 60 miles is recommended; 100 plus miles is better, especially if you are located in a high-risk area such as a large city or an area of known severe weather (tornado alley, etc.)
The secondary site could be self managed or provided by a third party, but in either case it should, as a minimum, provide customer-dedicated Web access, application servers, RAID storage, archive disk storage, and firewall protection. In addition, it should give you the ability to rapidly engage the secondary environment (ideally in under 15 minutes) at the push of a button from anywhere at any time.
A fully equipped and managed secondary site with rapid guaranteed failover will also allow the enterprise to leverage the secondary environment for planned systems maintenance and distributed high-availability, obliviating the need for late-night maintenance down time, expensive and complex local clustering hardware and software, manually intensive recovery processes and procedures, hardware maintenance contracts, and custom control software development.
Step 3: Replicate Your Data Off-Site in Real-Time to Avoid Data Loss
The key to having an effective secondary environment is to make sure that the data stored on it is never out of date. It is therefore vital that there is synchronization between your primary system and the secondary.
This is faster and more efficient than relying on traditional tape back-ups. In the case of a failure at the primary site the recovery process is immediate as the data is available in a live “hot” environment and can be accessed at the push of a button. With an holistic system design as described earlier, synchronization can be in either direction between primary and secondary sites, ensuring that any changes made while operating from the secondary site are reflected back on the primary when it is restored.
Step 4: Archive Your Data to Recover From Corruption/Deletion
The replicated data on the secondary can also be used to create traditional back-ups that support any existing archiving procedures without disrupting work patterns at the primary production site.
For instance you can take snapshot copies of your data from the secondary environment and transport the copies to online disk storage as well as offsite to another facility, and make those archives available to the operator to facilitate restoration of corrupted or lost data in the event of unintentional deletion or data information store corruption. A 24-, 48-, and 72-hour snapshot of the information store should be kept on disk, eliminating tape back-up and lengthy restorations from the recovery equation.
Step 5: Don’t Let Change Affect Recoverability
The biggest impediment to successful recoveries is that changes to the computing environment are introduced in the primary production servers and there is insufficient operational discipline or management tools in place to ensure the same environmental changes are reflected in the secondary recovery environment. The result is typically a failed or extended recovery effort when it is realized additional changes, patches, or service packs must be applied to the recovery environment in the middle of a real disaster.
This common problem can be solved by incorporating an automatic environmental change detection and notification process. When a patch or environment change is applied to the primary environment, the change is detected and the operator is alerted to the fact that a change has occurred, which may impact recoverability to the secondary environment.
Step 6: Defend Against Viruses
Viruses in e-mail attachments can easily overload an e-mail system and bring it down. Anti-virus software should be deployed to your front-end servers at both the primary and secondary environments and include continuous virus signature file updates. The anti-virus software should also include filtration of e-mail attachments inbound to your e-mail server and eliminate the threat before the virus hits an end user’s inbox.
Step 7: Defend Against SPAM Capacity Consumption
E-mail SPAM can be very costly to any organization. From a systems point of view it can consume undue CPU, memory and storage capacity on your e-mail system. When you add in the time that employees spend deleting SPAM mail and cleaning up mailboxes, the cost to the bottom line can quickly run to thousands of dollars, even for companies with as few as 100 e-mail accounts. For larger corporations the cost can be in the hundreds of thousands.
Anti-SPAM features which ensure SPAM is eliminated before it hits the e-mail information store and takes up resources should be deployed to both your primary and secondary servers as part of any solution.
Step 8: Don’t Share: Ensure You Have Dedicated Hardware and Storage for Your Recovery
Recent events have shown the flaws in traditional disaster recovery service provider offerings. There is no real guarantee that you will get access to the recovery servers when you need them most because they are shared, not dedicated to you, and they are sold many times over to other companies in your region. If a regional disaster hits, you will be competing with all those other companies in the region for limited resources.
Ideally select a service that provides a set of dedicated servers, storage, firewall and software specifically for your company’s data protection, high availability and disaster recovery. No sharing – yours alone, guaranteed.
Step 9: Think of the Service from the End User’s Perspective
With any e-mail system, it’s end user availability that ultimately counts. The end user (i.e., your CEO) doesn’t care that it was down because of planned maintenance or server failure, network failure or a facilities problem. The CEO just knows that he can’t get to his e-mail system to communicate with his direct reports, investors or customers. Develop a holistic approach toward e-mail availability, as seen by the end user, and you can then defend against any and all potential causes of system impairment or downtime.
Step 10: Manage it all as One Distributed, Integrated System
The best way to bring together all the ideas and concepts discussed in the previous nine steps is under the concept of a system that manages the whole disaster recovery process from a holistic point of view. This gives the system administrator the ability to monitor and manage both primary and secondary environments as one integrated system, all from a single Web-based console. This may sound like an ideal scenario not matched by reality. However, recent converging trends in high availability/disaster recovery technology and changes in methodology make it possible.
By focusing on mission-critical applications that require very fast recovery time objectives (RTO) a holistic approach can not only protect your e-mail but give a positive return on your investment.
Alan J. Porter, MISTC, is the product manager for Evergreen Assurance Inc., providers of real-time disaster recovery for mission-critical systems. Feel free to contact him at email@example.com.