- Published on Tuesday, 30 October 2007 10:27
This article is intended to depict the various stages of 'Disaster Recovery' activity that I felt necessary for us, at AMD, to follow. I will also cover some of the misconceptions, false senses of security, vulnerabilities, etc., involved with this subject.
We began this project by evaluating four different companies and selecting SunGard as our professional 'Disaster Recovery' company. I believe these types of companies cannot do the work for you but they are of major importance in organizing/conducting interviews, documenting, employee training, and assisting in developing the test and monitoring portion of the plan.
In other words, if I don't specifically mention them throughout this paper, I want it clear that they were an integral part of the process throughout.
'Disaster Recovery Planning' can be such a major, intimidating, costly,and 'dooms dayish' type of activity that no-one wants to be faced with the task. The attitude is, that we probably won't have a disaster, and if we do its going to be a long time off.
Another paradigm is, to qualify as a disaster, the whole area will be destroyed and there will be no manufacturing area, so who cares about the computer systems and information.
Disasters fall into two major categories, ' local' and 'regional.' Simply put, a local disaster is one that affects one building/location, whereas a regional can affect blocks, miles or counties.
The plan must accommodate the 'Worst Case' scenario which in the interactive world of Semiconductor manufacturing implies that 1) a secondary computer site is available, and 2) real time communications, whether it be 'LAN' or 'Wide Area' must be intact to the selected alternate computer in the event of any disaster.
A common disaster could consist of:
1) a mistaken or erroneous situation causing the fire sprinklers to go off over the computers.
2) a small fire caused by an electrical short
3) Lightning striking
Disasters do not have to be 100 year floods or eight-point earthquakes. All it takes to be a disaster is something that could mess up approximately 1,500 square feet of very important real estate.
The Data Center
During the BIA (Business Impact Analysis) process early in any disaster plan, no matter what eventual price tag is placed on the plan selected, considering all aspects of financial and intangible losses, overwhelming justification becomes intuitive. Keeping this in mind, 'Disaster Recovery Planning' should not be something that we decide if we're going to do, it should be something driven down from management demanding that this activity begin immediately, even if it takes additional personnel and funding to accomplish. 'Disaster Recovery Plans' are not something that is done and placed on a shelf until the BIG ONE hits.
They are living and breathing contingency plans that represent approaches to recover from all levels of failure.
Everything from a user requiring a file restored from backup because he inadvertently deleted it, to a major act of God like an earthquake or flood, to a stupid human error like a wiring short or a broken sprinkler. All of these need to be anticipated in a good 'Disaster Recovery Plan' It also must be designed to react to the ever changing application and hardware approaches.
The California CIM configuration addressed in this Disaster Plan is a centralized 24-hour, seven- day week VAX environment supporting approximately 1,200 concurrent users exercising a large 'Shop Floor Control' package. On a weekly schedule (Friday- Sunday), all of the 160 gigabytes of disks are backed up to tape, and on the following Wednesday the tapes are relocated to an offsite vault where they are retained per departmental retention schedule pending future requirements to recover lost data. Copies of these tapes are also retained on site in the computer center. Incremental back-ups are taken daily,
(Sunday-Thursday) but those tapes remain in-house, thereby vulnerable in case of a situation damaging the local tape repository.
Prior to implementation of the subject Disaster Recovery Plan, offsite backup tape storage was the definition of a 'Disaster Recovery Plan'. In many cases, this may be adequate, but realistically it implies that, at AMD CIM, for example, we could have been as much as 10 days out of sync with the data recovered. The amount of transactions necessary to recreate that 10 days of activity would be in the hundreds of thousands. This would constitute a horrendous manual effort, taking hours and/or days of precious time, resulting in a staggering impact on company revenues and ultimately customer satisfaction.
We, at AMD, subscribe to a Digital Equipment Corporation service referred to as 'Recoverall' which guarantees priority replacement of any damaged piece of hardware. Even with this service, replacement would take a minimum of one week. This equipment replacement solution does not solve the problem of where to put the replacements if the Data Center is extensively damaged or how to achieve adequate electrical power or network connectivity.
There are many approaches intended to respond to equipment/data center replacement. Some supply large vans with Air Conditioning and Power Distribution Units, others are passive duplicate computer centers on some other company's premises (preferably in another geographic area). In the case where vans are brought to your parking lot, power and network become the major stumbling blocks for reinstatement of the Data Center, either of which would probably take longer to replace than the Computer hardware itself.
The offsite Standby Computer Center is difficult to sustain because maintaining enough network bandwidth connectivity with this offsite center which would keep necessary crucial files current enough to be useful in case of failure, would require an extremely large expense for a capability that we hope would never be used. The other network consideration is, many regional disasters damage long-line connectivity which would render the Disaster Recovery approach useless.
The chosen approach, it's limitations and vulnerabilities has to be totally understood by all organizations involved. The user management must be aware of the potential recovery delays and the amount of work required on their part for recovery. The CIM management must be aware of the overall best and worst case response time through all phases of recovery.
This obviously implies that full functionality recovery will not be immediate and priorities must be established in advance as to what capabilities must be restored and in what order, i.e. engineering analysis surely would fall behind reinstatement of Work in Process scheduling activity (get the Factory running).
We have to assume that the goal of any disaster recovery plan would be complete recoverability immediately. Our previous (offsite tape storage) approach could take days, or weeks to reestablish the databases alone. As stated before, a BIA (Business Impact Analysis) must be performed to determine the threshold of pain acceptable, verses the amount of resources (money and effort) to invest in 'Disaster Recovery.' This has to be driven from the 'User' organization because their the only ones that can truly realize the total cost and value of losses incurred by unexpected downtime affecting the manufacturing process. Acceptable compromises will result from the above analysis indicating the worst impact (4 hours downtime maxi at AMD) that the manufacturing environment could endure.
That answer will direct the 'Disaster Planning' activity to a limited number of alternatives that will satisfy the requirement.
Although there are some commonalties, the solutions within our company are different, for several reasons, between California and Texas CIM organizations. This paper is only enlightening you on the California Disaster Recovery solution.
A Given: -The worst position a CIM support organization could be in is for the FAB to be physically capable of operating but the CIM computers are disabled disallowing manufacturing productivity.
If the FAB is physically damaged by a disaster, immediate CIM computer availability is of little consequence.
This philosophy mandated the decision to physically place the backup computer environment in a sub area immediately beneath the manufacturing floor (FAB). Theory is that if the FAB is unharmed, there is a good chance the Disaster Recovery computer is o.k.
There are several variables that come into play in deciding the proper Disaster Recovery solution for your specific site. Obviously, as mentioned before, Costs, Network capabilities, Equipment involved, and others. In a VAX world other considerations have to do with whether, or not, your environment is that of a single cluster, or if you are running multiple clusters.
In California we are running a single production cluster which gives some latitude in product selection. Consequently we were able to select what we consider to be the Cadilac of Disaster Recovery Systems. Because of our environment the Business Recovery System (B.R.S.) Package from Digital Equipment Corp. is a perfect fit. This package allows us to physically have the Cluster split between two locations (Data Center and Sub Fab) with FDDI replacing the Computer Interconnect logic connection.
The VAXCluster console is replaced with two Operation Management Servers (OMS), one in each location. Either of the OMS's can control and give visibility to either location or the entire cluster. The major benefit of this dual location configuration, as you may have guessed by now, is the ability to 'Shadow' (mirror for you UNIX people) all critical disks in both locations thus allowing the secondary location to be the primary user system whenever an incident is experienced with the main computer center or network. This takes place without the loss of one transaction.
Another major benefit, differing from the normal Disaster Recovery environment, is that the backup (sub fab) system can share as a partner in day to day productional responsibilities.
We elected to assign reporting activity to it because that would be easily suspended in the event of a disaster. The only downside is a limitation to the number of concurrent users in a disaster mode (+-100). This is strictly a capacity issue (3 computers replaced by 1) arrived at by user management and could be altered simply by adding computer horsepower.
There is an additional opportunity that appears to be a logical extensions while planning and implementing a 'Disaster Recovery Approach':
- Near lights out computer room with robotic tape handling seems to be a logical activity to pursue at this time because on-site vs. off-site storage of tape is a fundamental requirement and the more this can be streamlined, the more current offsite information will be, i.e. daily vs. weekly.
- Implementing a tape silo (jukebox) has several benefits which reduce operator interaction, thereby positioning us in line for a 'near' lights out environment. This itself has great merit, but in this paper I want to address the use of silo's in a Disaster Recovery role. I have not yet implemented this at AMD, but my vision is that silos will be located in each location and all key information will be duplicated in both locations. As long as the two sites are in the same proximity, offsite backup retention will continue to be a requirement but would be rarely utilized because the low probability of both sites being destroyed in a single incident.
- If your Disaster Recovery solution, however, included a geographical separation between your primary and secondary computer centers, offsite storage would not be a requirement thus a full 'Lights Out' DataCenter environment could be achieved. Keep in mind that each site is unique and your Disaster Recovery approach must be selected to compliment that individuality. For example, the data storage and recovery approach of a centralized computer environment, such as ours, differ drastically from a distributed environment (client-server), but the bottom line requirement is the same (complete recoverability in an acceptable period of time).
As we evolve from a centralized, monolithic to a 'Client-Server ' environment, all of the 'Disaster Recovery Planning' must be reevaluated. The goal of this paper is to convince it's readers that Disaster Recovery Planning is one of the most important activities that we should be addressing at this time, and for the foreseeable future. I believe that we not only have to be concerned about earthquakes, fires, floods and storms, the corporate world is becoming more and more a target of terrorist, disgruntled employees and other two legged dangers causing the occurrences of disabling disasters to become more and more prevalent.
To achieve 'WORLD CLASS' FAB Status, you must achieve 'WORLD CLASS' Disaster Recovery capability and I am confident we, at AMD CIM, have done that.
I must thank AMD's senior management for recognizing the need for a CIM Disaster Recovery capability, supporting me and my organization in the design and implementation of this plan, and opening the pocketbooks allowing us to implement a quality product.
Dan S. Perry has managed different Computer Systems support organizations in excess of twenty years. Currently he is the Department Manager over the California Systems and Operations organization within Information Technology Management (ITM).