For many businesses, from airline reservation centers to stockbrokers, the telephone is their lifeline. But businesses are quickly discovering just how fragile that lifeline can be.
The dependence of the nation’s businesses on reliable telecommunications is tremendous. In New York City alone, with its myriad of financial markets and corporate headquarters, an estimated one trillion dollars in daily financial transactions depends on telecommunications networks. Sixty-three percent of New York’s non-agricultural jobs are in telecommunications-intensive industries.
But three times in the last two years, New York has experienced the loss of the greater part of its telephone service. And other metropolitan areas have suffered similar outages.
On September 17, 1991, a broken rectifier in a Manhattan switching center shut down New York City’s AT&T long distance service for six hours. The outage occurred late in the afternoon, meaning most businesses were unaffected. But it cut off contact between the region’s several air traffic control centers. The result was lengthy delays at New York’s three major airports. The resulting backups delayed air traffic throughout the country, disrupting travel for thousands of passengers.
Another recent phone outage was national in scope. On January 15, 1990, AT&T lost most of its long-distance service nationwide for eight hours. Businesses ranging from telemarketers to stock brokers to travel reservation centers lost hundreds of millions of dollars in business. Only 40 percent of long distance calls placed that day went through.
In 1991, telephone outages affected major metropolitan areas in the U.S. on 11 different occasions. (See figure 1). The outages lasted anywhere from fifteen minutes to eight hours. Customers were either unable to complete calls within their local dialing zone, or were unable to make long-distance calls. The effect on businesses in the affected areas was devastating.
Companies lost astonishing amounts of business in very short periods of time. A stock broker in Washington, D.C. estimates his office alone lost $25,000 in business when a computer switching problem knocked out phone service to the District of Columbia, Baltimore and parts of Maryland, Virginia and Delaware for six hours in June. In the nationwide AT&T disruption in January 1990, American Airlines’ reservation center in Tulsa, Oklahoma lost an estimated 200,000 calls.
You probably have heard the story told by the old poem, but let me retell it as accurately as I can recall it. It goes like this:
For the want of a nail, a shoe was lost.
For the want of a shoe, the horse was lost.
For the want of a horse, the rider was lost.
For the want of a rider, the battle was lost.
For the want of a battle, the kingdom was lost.
The story describes how a seemingly inconsequential detail can lead to a disaster.
Consolidations of responsibility to save money, laying off technical and management staff during cutbacks, not taking enough time to train people in their new positions and omitting routine maintenance and testing can and has led to disastrous outcomes.
We don’t know what really may have happened regarding the failure of the horse shoe, but we have seen the result. Who was to blame? The generals? The blacksmith? Did the rider knowingly take a poorly equipped steed into battle? We just don’t know.
Are we doomed to repeat the mistakes of the past? If history doesn’t repeat itself, circumstances with a propensity towards disaster certainly do!
On Tuesday, September 17, 1991 another “shoe was lost for want of a nail,” and a set of circumstances was created with a risk exposure of frightening proportions. It’s another case of how missing “nails” could have contributed to the incredible loss in service.
Actually there were at least three (3) nails improperly installed in this particular “shoe.” The first was the apparent absence of proper maintenance and testing of the backup power system. Then there was a bulb, yes, a bulb, in the visual alarm system which had not been replaced when it burned out. Then there was the audio alarm which reportedly “malfunctioned,” whatever that means. These “nails” contributed to the system failure.
In this case, circular blame will be generously spread, and the accepted truth of what happened will be whatever account is repeated the most times.
What we do know is there is a trend across the nation to cut costs, increase productivity, decrease personnel and in general do-more-with-less in telecommunications. Budgets are being cut, technical and managerial positions are being eliminated, reporting relationships are being changed and responsibilities are being reassigned.
All this is being done while we are increasing our reliance on telecommunications for virtually all our business, government and personal functions.
Our increasing dependence on telecommunications and our increasing reliance on telecommunications based services will result in additional and devastating disasters. Undoubtedly some of these events will be due to missing nails in the shoes. So what can we do to minimize system failures?
Management must effectively integrate telecommunications into the disaster recovery process. Telecommunications must have the same reporting level as facilities management, security and data processing. All telecommunications should report to one responsible person, including telephones, data communications, local area networks, hard-wired data cables, intra- and inter-building cables and communication paths, remote location and long distance networks and all telecommunications supporting computer systems in the report.
Management must insist on the development of a strategic plan for disaster recovery. This plan should contain input from all parts of the organization and should have the objective of mitigating damage in a disaster. The plan should appear on the agenda of the top management meeting at least quarterly.
Finally, management must review the “nails” regularly. Are the visual and audible alarms working properly? Do the rectifiers work? When were the backup power systems tested, and how many hours were they run? Five or six hours or only ten minutes? How many reports of minor failures were reported, by whom and when? Are there any patterns?
What is staffing based on for telecommunications? Is it based on budget cut objectives by an inexperienced manager under pressure or is there some rationale to staffing? How about using a standard such as the number of ports, number of miles of cables, number of locations, distance between sites, number of additions modifications and deletions of terminal equipment, the degree of system management computerization, the relative difficulty of managing different systems, the number of shifts worked at a site and the experience base of the staff?
Let’s not wait for an occasion to place blame. Instead let’s plan effectively, using every glitch in a system or discovered loose nail as an opportunity to learn and plan better. Let’s use the recent event as the impetus for reexamining our policies, procedures, job descriptions, staffing guidelines, organization charts and systems. Let our objective be to insure the “nails” are able to support the shoes, riders and battles.
Benjamin W. Tartaglia, MBA, CSP, is President of BWT Associates, Independent Consultants to Management. The firm specializes in loss prevention, mitigation and disaster recovery relative to telecommunications.
This article adapted from Vol. 4 #4.
Ever notice how famous celebrities frequently pass away in threes? If we agree on that statement, then we can say that AT&T’s network has indeed achieved star status. It recently died for the third time in just under two years. Amazing how such a vast, powerful, diversified network can be brought to its knees so easily. Perhaps that is what happens when one is so big and powerful—it becomes more difficult to manage the beast.
AT&T is certainly not alone in network crashes. Bell Atlantic and Pacific Bell had major failures a few months back. Millions of users lost service for several hours. Illinois Bell’s legendary Hinsdale central office fire blew the notion of unsinkable central offices sky-high. New York Telephone has had its share of central office fires and power outages in the past ten years.
What all these events now state—in absolute terms—is that today’s networks are failure-prone. Maybe not everyone's; some will always manage to stay one step ahead of the grim network reaper. But can we be assured these networks will provide truly uninterrupted service? (By the way, carrier tariffs don’t guarantee “uninterrupted” service, but usually “universal” service.)
This author casts a strong vote for tin cans and string. Telecommunications professionals who have been in the industry for at least 10-15 years will remember that “the network” was built on far less sophisticated equipment. Today’s digital networks, with their powerful switches and common channel signaling networks, handle massive amounts of voice and data traffic.
AT&T’s most recent outage could have been prevented. It was largely a failure to follow established company procedures. Human flaws. So it’s not just technology we have to fear. Rather, it’s just as Pogo so astutely observed. He said, “We have met the enemy, and he is us.”
But this is not intended to be an AT&T-bashing session, although New York air traffic controllers might wish otherwise. AT&T’s network is the biggest and most complex in the world. It has extensive recovery and self-healing properties. To get an idea just how vast AT&T's network is, visit the carrier’s National Network Operations Center (NNOC) located in Bedminister, New Jersey. The company really knows how to run big networks.
But how many more network outages will users—of this carrier and all others—be forced to endure? Advanced technology brings with it both benefits and curses. Unfortunately, we quickly forget all the benefits when a weakness appears.
Here are some very significant concerns. Public switched networks in the U.S. are generally managed by common local channel signaling (CCS) networks. Both local and long distance networks use CCS technology. However, most local and long distance CCS networks do not currently interconnect. But that’s going to change over the next few years. Service quality is definitely going to improve by interconnecting CCS networks. However, CCS network interconnection also implies that signaling network failures (e.g. AT&T, Bell Atlantic, Pacific Bell) could spread to long distance networks, creating outages of massive proportions. Are we ready as a nation to deal with that?
Both telephone companies and long distance carriers maintain that their networks have safeguards in place to deal with these and other scenarios. But can we be totally certain they will work? If this is indeed the case, telecom managers will need network contingency plans more than ever. We must ask ourselves: is technology bringing progress, or do we now have loaded guns at our heads?
Carrier “service assurance” programs are evidence of the growing concern over network survivability. Bell operating companies, major interexchange carriers, and a growing number of independent telcos have customer service assurance programs. These are in addition to their existing network protection programs. These are major ongoing activities.
An example is Southwestern Bell’s MegaLink III service. This is a non-switched, dedicated point-to-point digital service for simultaneous two-way signal transmission at 1.544M bps. The tariff guarantees that if MegaLink III service fails due to telco-provided equipment or facilities, the customer receives a full month’s free MegaLink service. This is contingent on the telco being unable to restore service within four hours after the outage was reported.
The Federal Communications Commission (FCC) has called for meetings with Bell Atlantic and Pacific Bell to determine what it can do to prevent these events from recurring. It all confirms a growing fear among many in the U.S. telecom community: that America’s public switched networks are increasingly vulnerable.
This latest event will force us all to face the reality of the '90s. We must assume that carriers’ networks will fail. Not “if,” but “when.” That’s a threatening state of affairs. You no longer have a choice but to be prepared. Ask yourself, “Can I survive the next time?”
Paul F. Kirvan is a principal in Paul F. Kirvan & Associates, an international telecommunications consulting firm.
This article adapted from Vol. 4 #4.
On Tuesday, September 17, a broken rectifier in an AT&T switching station generator in Manhattan shut down New York City’s long-distance phone service for six hours.
Because the failure took place at 4:50 p.m., most businesses, including the stock markets, were relatively unaffected. Air traffic in and out of New York’s three airports, however, was severely delayed as airports lost contact with regional air traffic control centers. Many other national and international airports were affected.
The failure occurred after AT&T switched over to its own back-up diesel generators to power the switching station. For several years, AT&T has had an agreement with Con Edison, the New York Power utility, for AT&T to generate its own power when Con Ed is facing greater than normal demand for electricity.
The failed rectifier in the company’s backup generator caused the station to automatically begin drawing power from its emergency batteries, which can supply about six hours of power. The switchover occurred at about 11 a.m., but no one noticed that the generators had failed until after 4 p.m. The broken rectifiers also made it impossible to switch back to city power when the failure was discovered.
The fact no one noticed that the station was running on batteries “resulted from a combination of highly unusual circumstances,” said Kenneth L. Garrett, AT&T senior vice president-network services. Audible and visual alarms should have told workers that the station was operating on its own power, but those alarms were disabled.
Garrett said a supervisor should have assigned responsibility to physically inspect the building’s power plants during the conversion to emergency diesel generator power. The supervisor and three technicians had left the building for training in connection with a new alarm system.
The shutdown was the third major failure of AT&T long-distance service in New York City in less than two years.
Stuart Johnson was an editor with Disaster Recovery Journal.
This article adapted from Vol. 4 #4.
Channel extension is a relatively new technology that can be used to solve some age-old data processing problems. Ideally, a computer environment should have centralized control and operation of the computers with decentralized access for the input and output operations. First I will describe the more common ways that this is accomplished; next I will next explain some of the dynamics of channel extension; and finally, I will describe some network designs using channel extension. Because channel extenders have been developed mainly for the IBM mainframe environment, the explanations and examples will be from an IBM viewpoint.
How do most large, geographically diverse computer networks solve their data input and output problems? The most straightforward solution is to have all input and output devices in a local configuration. Due to the electrical limitations in supporting high speed devices such as disks, tapes, and high speed printers, all mainframe channels are limited to hundreds of feet of cable. Therefore, if a location needs both data entry and hardcopy output, then a computer is usually installed. Small mainframes and large minicomputers are used to solve this problem for sites that need a number of CRTs and printers without a lot of processing.
Another method that can be used to solve remote data entry and output is the use of data communications. This permits users to be located at geographically remote sites. The main limitation of standard data communications is the speed at which input and output data can be transmitted across communications lines. Remote Job Entry (RJE) terminals are designed to run impact printers at communications speeds of 9600-56000 bps. Two of the most popular communications protocols are Bisync and SNA, both of which require special communications software in the mainframe. The physical connection is made through a communications front end processor. Special software is needed in both the mainframe and the front end processor, increasing the expense and complexity of the environment.
Pacific Gas and Electric Company (PG&E) understands the impact that being so close to a major fault line can have.
With terrifying lessons learned from the historic earthquake of 1989, PG&E is implementing a sophisticated disaster recovery plan based on mirroring data centers. By combining a sophisticated data communications networks and automated tape vaults, PG&E’s approach to mirroring allows their two primary data centers to back-up each other’s critical applications.
In the event of a disaster, either data center can begin restore their critical applications and begin processing on remarkably short notice.
It seems like only yesterday. October 17, 1989. The infamous ’89 earthquake hit the San Francisco Bay area claiming 62 lives and causing damage estimated at as much as $10 billion.
At 77 Beale Street, the Pacific Gas and Electric Company (PG&E) data center was in the process of changing to the evening operations shift. Fortunately, many employees had left work early to catch the World Series game between two bay-area teams: the San Francisco Giants and the Oakland Athletics.
In the aftermath of the earthquake, PG&E would face a utility company’s nightmare of repairing both their natural gas and electrical distribution systems while in the midst of a disaster situation. PG&E president George Maneatis would comment, “The earthquake was the worst crisis this company has faced in my 36 years here.”
Have you ever considered how a disaster might affect your communications? Is your communications sufficiently recoverable or is it like most companies. There is an alarming amount of corporations without functional communications backup, even though they have a plan and some even have a backup recovery site. The industry norm is to use dial-backup for recovering from lease line failure. This is sometimes functional but not very reliable. In the following paragraphs we will go into detail as to the services available that are alternatives to dial-backup, and how to recover using these services.
AT&T ACCUNET Digital Services
Today, the use of high-performance communications systems to link computers in intra- and interorganizational corporate configurations is delivering on the promise inherent in information systems technology.
Advanced communications capabilities integrated with information systems are producing new opportunities for business...to enhance the value of goods and services...to take advantage of expanded markets...to reduce overall operating costs...in all, to gain a competitive advantage.
The devastation of people’s lives and or businesses because of disasters, such as the Kobe earthquake and the Internet break-in, have become frequent front page news. As recently as 10 years ago corporations gave little time, thought or money to disaster recovery planning. Now, most business people are at least aware of the need. The time, thought and money are following at a slower pace than many of us would like, but at least contingency planning is a possibility. Curiosity of the subject is spreading quickly.
Planning for anything requires a framework. Planning for a successful disaster recovery has the following elements, not necessarily in this order:
- Business Impact Analysis
- Risk Analysis
- Critical Application/Function Definition
- Recovery Plan Development
- Recovery Plan Test
- Recovery Plan Maintenance
Your local telephone company has a plan for your telephone service in the event of an area wide disaster such as an earthquake or hurricane. For most companies that plan is to shut off your dial tone so you can’t make outgoing calls.
Only customers with a public health and safety responsibility such as hospitals, fire departments, police, mayors, city council members, etc. will be able to use the system. They have been identified ahead of time as “essential service,” “priority service” or some other such designation in case of a serious problem. Of interest to the general public is the fact that public telephones receive this designation as well. If you can’t get dial tone at home or the office, a nearby public telephone may be your best chance to get access to the telephone network.
Your other alternative for gaining access is the cellular system. Presently most of the cellular companies make no attempt to allocate service. The network is first-come first-served. Any contingency plan which doesn’t include cellular as one of several backup methods has missed a major opportunity to establish vital communications.
The system may be jammed immediately after a disaster, but your delay in accessing the telephone network is likely to be measured in minutes rather than days. Most contingency plans make the assumption that the landline telephone system will be out for three days.
But who can you call if most of the telephones don’t work in the area affected?
Getting back to that plan of the local telephone company. Usually they only shut off outgoing calls. Incoming calls can still be received in many cases. Sometimes the switching centers may be overloaded. This will limit incoming calls, but again this condition is likely to last for minutes rather than days. In other words, your family and business associates may not be able to call you if they can’t get dial tone on the regular telephone, but you may be able to call them if you have a cellular phone. Also, you will likely be able to call out of state on cellular no matter what the local situation.
This conscious policy of the local telephone company is much more likely to impact your ability to get telephone service than physical damage to telephone facilities as a result of the disaster. Physical disruptions are difficult to predict and consequently more complicated to prepare for Known policy considerations fortunately can be anticipated in your planning process.
The telephone network in the United States reaches 97% of the homes and offices. It is by far the best vehicle for two way communication in terms of the number of places you can contact if you can gain access to the system.
Knowing the importance of including cellular in your contingency plan, how do you pick the right telephone?
“Handheld” cellular phones generally operate at a maximum of .6 Watts. This is fine in mature systems were the cell sites are close together. This may not do in a disaster if some of the cell sites are down, however. The cellular network is self correcting in that you are always talking on the cell site which is giving you the strongest signal. If the closest cell site is damaged you will automatically be transferred to the next best antenna. Now, though, you may have to transmit the signal 14 miles rather than the two miles which was customary when all of the cells were working. The extra power of a 3 Watt “transportable” phone can make the difference in getting a call through or not.
Another limiting factor of cellular phones is how long the battery will last between charges. This is usually quoted in terms of standby time (when the phone is turned on to receive incoming calls but not engaged in transmitting a call) and talk time (when the phone is transmitting a call). Get the phones with the longest ratings in the event you are unable to charge them during the disaster period. You can also obtain extra batteries, but someone will have to make sure they are kept charged all the time. The best phones for disaster recovery applications can be kept charging for a year or more. Most cellular phones are not intended to withstand this, and you should check on this feature before selecting your phones.
Batteries are also important in selecting your cellular phone in that most use NiCad power supplies. These are notorious for a declining ability to hold a charge over time after recharging. This is known as a “memory problem” which is caused by not completely discharging the battery each time before it is recharged.
Sealed lead acid gel cell batteries are much less temperamental to repeated charges. They will probably provide almost the same amount of talk time and standby time after a significant amount of use as they did when new.
Most of the cellular geographic service areas in the United States have two companies which provide cellular service. You may want to obtain cellular phones which allow “dual number registration” and sign up for service on both systems. That way if one system is down or jammed you can switch over to the other system. Unfortunately, this is the kind of redundancy only a disaster recovery planner could love, and someone is bound to point out correctly that this will result in two telephone bills a month for each phone.
Another feature which might be beneficial is alphanumeric memory. Names as well as numbers are stored electronically. This eliminates the need for a separate telephone list which can be misplaced in an emergency. Simply scroll through the memory to “VP Security” or “Jones” and the phone number appears automatically. This feature requires periodic updating in the event of personnel or telephone number changes.
No matter which phone you select, make sure the people who might need to use it are trained on how it works. During a crisis is no time to learn that cellular telephones don’t provide a dial tone before you place a call. The use of the “send” and “end” buttons are equally critical.
Be sure to test the phones every two or three months to verify that they are in proper working order.
Cellular phones should be purchased before the disaster, of course. Not only will you have them when you need them, but you can publish the telephone numbers for the Disaster Recovery Team to receive incoming calls from the people they need to hear from.
With regard to financial issues, quantity discounts are sometimes available when purchasing 10 or more phones at a time. Some cellular telephone companies may have special rates for disaster recovery usage. Pac Tel Cellular in Los Angeles, for example, has a standard rate of $45 per month, an economy plan for $25, and an earthquake preparedness plan for $16.50. Some restrictions apply to the special rate.
Cellular phones constitute an initial investment as well as an ongoing expense, but organizations that have been through a disaster recovery can testify that this expenditure is small in comparison to the enormous benefit that they provide.
James F. Kainz is president of Disaster Phone Communications, a Mermosa Beach, CA-based producer of disaster recovery cellular phones. He is a consultant on the use of cellular phones for disaster recovery applications.
This article adapted from Vol. 6 #2.
There has been a great deal written about telecommunications disaster recovery planning lately. The interest level is high regarding what to do if the Central Office fails, if the long distance carrier goes down or if a fire occurs and you have to vacate the building and telecommunicate from an alternate location. All of these recent situations are important, What do they have in common? None of them was related to a natural disaster.
Those human-made disasters were caused by oversights, mismanagement, power failures, equipment malfunctions and just plain old non-preparedness. Does the fact that they were “unnatural “ make them any less important or less dangerous? Of course not. But they are different, and the telecommunications aspects for natural disaster recovery planning is different as well.
Natural disasters, by their very character, are beyond the control of humans. The best we can hope to do is react in a meaningful time frame, in an orderly manner and to mitigate and repair the damage. Communications, whether written, spoken or telecommunicated become extremely important to the mission when the disaster has natural causes.
People react with great stress during a natural disaster. The first reaction is usually: “This can’t be happening!” or “It’s probably just a false alarm.” If a emergency room physician was talking to an accident victim, this response would be termed “denial.” Other responses to a natural disaster could be anxiety, anger, fear, confusion or panic.
There is a greater need for structure, clear lines of authority, easily readable procedures and definitive alternate courses of action during a natural disaster due to the psychological impact on people.