A single point of failure
One of history's largest and longest network crashes took place, with several layers of redundancy in place. How? Kevin Koski is VP/Technology at Data Base, Inc., a national off-site data security and disaster recovery firm whose business is keeping companies from losing data to catastrophic events. He said redundant systems do not protect against on-line vulnerabilities because any on-line fault can travel anywhere the network goes. "A self-healing or fault-tolerant network does well protecting itself from events like line cuts or failed hardware," Koski said. "There are still completed circuits all around the broken point, so the network still operates. However, redundant networks are not designed to protect against on-line threats like viruses or internal hacks, since something on-line, anywhere in the network, is experienced by the entire network.
"In AT&T's case, the network turned software failure in two nodes into software failure in all of them. So the network was vulnerable to an outage as a result of a single event: the failed software."
Koski said it's likely that many AT&T business customers discounted or dismissed the possibility of a network outage. Therefore, they took no steps to protect themselves against such a loss. Other AT&T business customers had untested backup plans, used for the first time in the fires of crisis. Many companies, he said, simply did not envision a disaster scenario in their planning that included the loss of the entire AT&T network.
One adequately prepared company was MasterCard, which uses AT&T's frame relay network to process $1.64 billion in credit card transactions each day.
TechWeb News (www.techweb.com) quoted a MasterCard spokesman as saying, "This isn't exactly the way we wanted to find out that our backup systems are excellent. Investing lots of money in multiple backup systems was exactly the right thing to do."
Chris Luise, chief technical officer of Skandia AFS, a worldwide insurance and financial services company, was quoted in Internet Week as saying many companies suffered because they felt having a backup provider was unnecessary. He pointed to the proliferation of sophisticated protection systems, such as self-healing backbones and SONET (Synchronous Optical Network) rings.
"Ten years ago, everybody had a backup strategy and at least two carriers. It's just not a strategy that people subscribe to anymore," Luise said. "It's put all your eggs in one basket and get the lowest price you can."
Lessons for the DR industry
Koski concurs. He said AT&T itself did not consider that an often-overlooked single point of failure" on-line threats " could bring down its network.
"Recognizing disaster scenarios that can disable entire systems is essential in the disaster recovery planning process," he said. "I've seen companies place their entire set of off-site, disaster recovery backups on-line, in order to automate, save money and expedite recovery. They fail to recognize that doing so exposes themselves to on-line threats that could leave them no other means of recovery following a disaster."
Koski said there are key steps companies can take to prevent downtime from events like the AT&T outage. For example, companies vulnerable to such an outage could engage two network providers. Another approach is to have tested backup plans that assume the loss of the entire network.
"We find that a lot of companies don't plan for a true "worst case disaster,"" Koski said. "Odds are, that worst-case event may never happen. But if it does, and you're not ready, you and your company could be history."
Steve Birge is a free-lance journalist in the Seattle area.