History: lengthy outages

Thu Jan 25 08:39:41 UTC 2001

On Wed, Jan 24, 2001 at 11:45:08PM -0800, Sean Donelan wrote:
> 
> That's a bit unfair.
> 
> There have been a number of lengthy outages.
> 
> AS7007 router configuration problem: April 25 1997 lasted 2 hours
> AOL (ANS router configuration problem): Aug 7 1996 lasted 19 hours
> ATT frame-relay switch errors: April 13, 14 1998 lasted 26 hours
> BBN standard power failure: October 11, 1996 lasted about 12 hours(off and on)
> NETCOM router configuration error: June 20 1996 lasted 13 hours
> Sprint database problems: September 3 1996 lasted 5 hours
> NSI root server corruption (operational error): July 16 1997 lasted 4 hours
> PacBell configuration problems: January 30-31 1997 lasted 48 hours
> UUNET frame-relay problems: July 1 1997 lasted over 24 hours
> UUNET cisco/bay router problems: November 7 1997 lasted 5 hours
> Worldcom frame-relay switch errors: August 1999 lasted 9-10 days

Your point isn't lost on me, but I think there are a couple of
distinctions to make here. I do freely admit, however, that I'm
not familiar with all of the outages you listed.

1. I think it might be prudent to weed out the 2-5 hour outages here.
While that's still an excessively long time to recover a change that
should have been monitored and tested properly in the first place
(and still probably cause for firing in some shops), I can at least
conceive of it taking this amount of time. Too long, yes, but not
quite in the jaw-dropping category.

2. Several of the remaining group that I'm familiar with (namely the
AT&T and Netcom outages) involved problems which cascaded out to the
entire network, and therefore could not be simply undone by backing
out the change on that router/switch/etc. I've been through an outage
or two which didn't gain such notoriety but still took several hours
after identifying the problem just to go out and reboot enough boxes
to settle the network down. In the MS DNS case, I don't feel the same
point applies. By most accounts, it appears the issue was a change to
a single router affecting one or two subnets, and reversing the issue
was simply a matter of backing out said change.

Again, don't get me wrong. Your point is taken, and I've certainly
caused and felt my share of pain with (sometimes unnecessarily)
lengthy outages. I agree with Randy about cutting the folks involved
some slack, since we've all been and will be there at some point. I
do see it the other way as well. This amount of time to resolve a
local issue caused by a procedurally implemented change (I'll give
them the benefit of the doubt) on a critical network device is
surely due some scrutiny.

-c