Netcom Outage (Was: My InfoWorld Column About NANOG)

Fri Jun 21 21:36:56 UTC 1996

On Fri, 21 Jun 1996, Peter Kaminski wrote:

> Can other big parts of the backbone fall down and take 13 (or more) hours
> to get back up?  Or is the rest of the net engineered more redundantly than
> Netcom?  Should I build two backbones, each with separate technologies?

Ask NASA how they do it. Three redundant systems using two separate
technologies. But then look at NASA's downside and compare it to yours. If
Netcom's customers hardly noticed this maybe the dialup market doesn't
care. However, the leased line market is a whole other story and they also
have the technical expertise to understand your backbone engineering and
perhaps pay a higher fee to have that redundancy. This question really
tangles up marketing and engineering concerns together.

> Was this a foreshock of the coming Metcalfean Big One, or just lousy
> procedures at one of the bigger ISPs?

The bigger they are, the harder they fall. Seems to me that as ISP's and
NSP's get larger, failures will be more spectacular. However, the big one
depends on the ability for failures to propogate from one ISP/NSP to
another and I don't think this is very likely. Partly due to the different
engineering styles and partly due to the diversity of technology deployed.
You have frame relay backbones, ATM fabrics, DS3 meshes with Cisco nodes
and DS3 meshes with Bay nodes.

Up until Netcom, the most spectacular failures I recall seeing over the
past two years were either caused by NAP congestion or backhoes. NAP
congestion is partially a management failure to deploy bigger pipes and
routers and increase the number of NAP's in time to meet the growth in
traffic flow. But it is also self-correcting as some customers migrate to
NSP's with less congestion and management injects capital into their
infrastructure. It seems to be a well understood problem.

But to me, backhoes are the most interesting failure mode. For one, I
don't think that backhoe problems can be eliminated and I think that as
the physical mesh of fibre becomes more finely divided over the geography,
these incidents will increase. And I also don't know of anyone taking
action to protect against these events by building geographic redundancy
into their backbones. This may be partly because NSP's often don't have
any idea where the fibres lie and partly because they want to use a
specific infrastructure like SPRINT and its railway rights of way. The
incident in the Northeast where a backhoe cut a Wiltel(?) fibre bundle
that was carrying critical DS3's leased by all the NSP's in the region
points out how catastrophic this can be. 

Michael Dillon                                   ISP & Internet Consulting
Memra Software Inc.                                 Fax: +1-604-546-3049
http://www.memra.com                             E-mail: michael at memra.com