DRJ's Fall 2018

Conference & Exhibit

Attend The #1 BC/DR Event!

Summer Journal

Volume 31, Issue 2

Full Contents Now Available!

In the midst of a summer heat wave, Diane Wier tried to withdraw cash to pay for airline tickets to attend a friend's wedding. The needed cash never emerged, only the annoying message - "Temporarily Out of Service" - appeared on the screen. From his cottage in Haliberton, Ross Baker couldn't reach his colleagues in downtown Toronto to get an update on a medial research project and arrange weekend meetings. All he got was a fast busy signal. And Damon Heart was frustrated in his efforts to reach his broker to act on a 'hot' tip.

Clearly something was amiss in the Canadian Telecommunications system, one of the most advanced in the world. From Vancouver to Halifax, and as far south as Chicago selective telephone and data communications lines were out and over a million people were affected. Canadians were in the midst of one of the worst telecommunications failures in recent history. And this was only the first of a triad of failures to plague the telephone system during a week in July.

What went wrong? And why did the redundancies built into these critical systems not work? We know some of the answers, but not the whole story. Here's a brief synopsis. Event #1 - Blaze in Bell Central Office Puts City on Hold

An electrical fire on the 4th floor of a Bell Canada Central Office (CO) in downtown Toronto caused telephone and data transmission lines to go dead. Much of the financial district was affected. Selective ATMs and data lines stretching across the country were also affected. Back-up plans were not sufficient to avoid system failure.


Event #2 - 911 Service Fails for 12 Hours

Two days later, in the region of Peel just west of Toronto, 911 service was down from 1:20 am until 2:00 pm affecting over 1,000,000 people. Hospitals in the region complained no one notified them of the problem - they heard it over the radio - so they could activate alternative plans. This length of outage in an emergency telephone system had not previously occurred in Ontario. An added glitch was that after calling 911 the system locks onto the callers line and does not disengage until deactivated by a service representative. At the time people needed help most, their phones were useless. What caused the outage is still murky.

 
Event #3 - PrimeLine Crashes During Y2K Changeover

Four days after the 911 event Bell's 'PrimeLine' service in southern Ontario failed denying access to 9,000 customers (over half of those using the service). Bell was in the process of transferring customers to a new Y2K-ready system when the new system crashed. Details are still lacking as to why this happened and when the service will be ready to meet the Year 2000.

With all the conflicting survivalist hype and 'good news' readiness reports contributing to the media noise about the run-up to the Year 2000, three telecommunications failures in one week leaves one feeling a little edgy, especially if continuity planning is your responsibility! So let's explore the Bell CO failure a little further. It is the event that got the most press, and had the widest implications.


What happened?

Early on July 16, 1999 an electrical contractor was performing maintenance at the Simcoe Street Central Office when a tool fell on an electrical panel. The resulting short was explosive, starting a fire that spread quickly and caused the sprinklers to activate. As planned, the back-up bank of batteries immediately took over supplying the required electricity for the main telephone switch box. But with a fire raging in the room with the back-up diesel generator and water all over the floor, activation of the generator to keep the batteries charged was too risky. Only after more than 70 firefighters had controlled the fire and mopped up could Bell staff bring in a mobile generator and hook it up. By this time it was late Friday afternoon and most affected businesses had closed.


What was the impact of the
Bell CO incident?

As infrastructure elements go, telecommunications is just not as "neatly" arranged as power or water. The lines that snaked into this CO came not just from adjacent buildings, but from many pockets in the city and from data services across the county. Even Bell Canada's own web site was affected, as were Internet Service Providers (ISPs) and numerous ATMs, and debit and credit card authorization systems. For some, the loss of telephones was inconvenient and they got a slightly longer summer weekend. Others attempting to close deals, send critical information or contact loved ones in hospitals were anxious, and in some cases panicky. So, exactly what didn't work:

Approximately 113,000 telephone lines were down in the city core and other areas.

The 911 service was operating at a reduced capacity.

Cell phone service was irregular and undependable. It did not work if it was routed through any of the affected lines. Some operating nodes became overloaded.

Credit and debit card authorizations were brought down in various locations across the country from British Columbia to Nova Scotia.

The Toronto Stock Exchange continued operating though volume and trading value were off significantly (down 39percent and 38 percent from the previous day) as brokerages experienced problems and customers failed to close sales.

Coordination of the city's fire and ambulance service had to work out of a Fire Hall radio room when the primary call centers failed.

Most of Toronto's largest hospitals lost telephone service. Of greatest concern was the Hospital for Sick Children's poison information line that normally handles over 400 calls a day.

About 10 percent of Canada's automated bank machine service was knocked out. Toronto-Dominion was the hardest hit with a third of its 2,500 machines useless for part of the day, causing customers to seek old-fashion human tellers to get cash or pay bills.

Communications to 570 traffic lights in the city center failed causing them to malfunction.

At the Art Gallery of Ontario the security system crashed leaving priceless works of art protected only by extra security staff.

Restaurants lost hundreds of dollars in revenue due to missed lunch orders and reservations. This type of loss will never be "made up".

All of these problems translated into unknown small and large losses, totaling millions, and all due to an errant spanner. One travel agent estimated he had lost $30,000 in sales. On the 'up-side' couriers did a booming business with rates jumping 500percent due to demand, and people trying to nab bicycle couriers on the street.

Bell Canada felt the series of events leading to this failure were "one-in-a-million". Today, however, with the complexity of the telecommunications systems we all depend on and the millions of wires and interdependencies required for everything to work, "one-in-a-million" may actually be quite a high risk. Events #2 and #3 mentioned above reinforce this line of thinking.

As was to be expected, reporters were touting this as a "taste of Y2K". Yet this was not a special incident or technological glitch in a computer system. It was a fire, albeit in an unfortunate location, that created a chain reaction of events that touched the lives of millions. Is this unusual? - No. Should we, as continuity practitioners, be surprised? - No. What we have here is evidence of the vulnerability of the infrastructure we take for granted, the systems we depend on daily for our livelihood still have single points of failure. Events like the Bell CO fire point this out. One insurance company had no idea all its communication lines went through one CO and is now actively looking at creating a split system.

There are two lessons from these events. First, we will survive such events and carry on despite the losses and inconvenience. Those with plans, however, will fair better, and do so with less anxiety. Second, we are reminded of the "1 percent rule of technology" - only 1 percent of a system has to be affected to bring it all down. In the technological world we have created everyone expects, indeed often demands, 100 percent availability. As a result, the question we all face is "Can we afford the cost to deliver on this expectation?". Often we can't, and that will keep the continuity business alive well into the 21st century.



John Newton, Ph.D., P.Eng. is the Principal of John Newton Associates, a business continuity consulting and research firm. John is a current member of the DRJ Editorial Advisory Board. He can be reached for comment at: 416.929.3621; This email address is being protected from spambots. You need JavaScript enabled to view it.