MS explains

Steven M. Bellovin smb at research.att.com
Thu Jan 25 14:30:19 UTC 2001


In message <Pine.LNX.4.30.0101242038380.5951-100000 at anime.net>, Dan Hollis writ
es:
>
>On Wed, 24 Jan 2001, Eric A. Hall wrote:
>> At 6:30 p.m. Tuesday (PST), a Microsoft technician made a configuration
>> change to the routers on the edge of Microsoft's Domain Name Server
>> network.
>> At approximately 5 p.m. Wednesday (PST), Microsoft removed the changes to
>> the router configuration and immediately saw a massive improvement in the
>> DNS network.
>
>So basically, it took microsoft 23 hours to fix a router configuration.

There's a story (possibly apocryphal) about the time that Steinmetz was 
called in as a consultant to repair problems with some massive piece of 
electrical machinery.  After poking around for a while, some staring, 
and much thinking, he adjust one screw, and solved the problem.  He 
then proceeded to write out a bill for $1000.

The company was outraged.  "$1000 for adjusting one screw?  You're 
crazy!"

Steinmetz agreed, took back the bill, and tore it up.  He then wrote 
out a new bill:

	Adjust one screw			$1
	Knowing which screw to adjust		$999

Remember the other half of Jim Duncan's post from last night:

	There were clearly some mistakes made, but it is 
	also the case that there were a _lot_ of different
	things going on that contributed to the problem or
	complicated its resolution.

He *worked* this problem; this is a first-hand statement, not 
conjecture by those who weren't there.

Let me put forth a blatant generalization of my own:  *all* major 
failures are due to complex causes.  The proof is simple:  if you're 
small and hence presumably clueless (the "mom and pop" ISPs another 
poster sneeringly referred to), your problems don't cause major failures
for the rest of the net.  If you're big (and hence presumably clueful), 
you solve the simple problems quickly and they don't become major 
failures.  Finding and fixing *the* root cause is hard, when you're in 
the midst of a swamp full of other alligators, and you don't know which 
one is (currently) biting you in the rear.  

I'd love to see a detailed description of what went wrong, and I hope 
that those in the know will be allowed to post it or present it in 
Atlanta.  But I'm willing to wager that it wasn't just (a) a single 
router configuration change, (b) brain-damage in Microsoft's DNS code, 
(c) malicious activity aimed at Microsoft, (d) RAMEN-induced 
misbehavior, or (e) any other single cause.


		--Steve Bellovin, http://www.research.att.com/~smb






More information about the NANOG mailing list