Spanning tree melt down ?

Fri Nov 29 19:21:06 UTC 2002

Thus spake "Eric Gauthier" <eric at roxanne.org>
>
> > Anyone have any idea what really happened :
> > http://www.boston.com/dailyglobe2/330/science/Got_paper_+.shtml

I can't speak to exactly what happened because of NDA, but I think I can
help NANOGers understand the environment and why this happens in general.

> I know someone who worked on it, but I've avoided asking what
> really happened so I don't freak out the day the ambulence drives
> me up to their emergency room :)  The other day, I did forward the article
> over to our medical school in the hopes that they might "check" their
> network for similar "issues" before something happens :)

I see a lot of Fortune 500 networks in my job, and I'd say at least 75% of
them are in the same state: a house of cards standing only because new cards
are added so slowly.  Any major event, whether a new bandwidth-hungry
application or a parity error in a router, can bring the whole thing down,
and there's no way to bring it back up again in its existing state.

No matter how many powerpoint slides you send to the CIO, it's always a
complete shock when the company ends up in the proverbial handbasket and
you're looking at several days of downtime to do 4+ years of maintenance and
design changes.  And, what's worse, nobody learns the lesson and this
repeats every 2-5 years, with varying degrees of public visibility.

This is a bit of culture shock for most ISPs, because an ISP exists to serve
the network, and proper design is at least understood, if not always adhered
to.  In the corporate world, however, the network and support staff are an
expense to be minimized, and capital or headcount is almost never available
to fix things that are "working" today.

> I don't know which scares me more: that the hospital messed up
> spanning-tree so badly (which means they likely had it turned off) that
> it imploded their entire network.  Or that it took them 4 days to figure
> it out.

It didn't take 4 days to figure out what was wrong -- that's usually
apparent within an hour or so.  What takes 4 days is having to reconfigure
or replace every part of the network without any documentation or advance
planning.

My nightmares aren't about having a customer crater like this -- that's an
expectation.  My nightmare is when it happens to the entire Fortune 100 on
the same weekend, because it's only pure luck that it doesn't.

S