Journal of Internet Disasters

Sat Nov 14 01:58:02 UTC 1998

On Fri, 13 Nov 1998, Michael Dillon wrote:

> >     - f.root-servers.net and NSI's servers reacted differently.  What
> > 	are the differences between them (BIND versions, in-house source
> > 	code changes, operating systems/run-time libraries/compilers)
> 
> Whatever was causing the Internic link to be congested could have
> disrupted NSI's server. Wasn't vixie's server acting properly by answering
> lame for the zones it could not retrieve? It seems like all the problems
> revolve around NSI's server and network. Vixie's problems were merely a
> symptom. On the other hand, I would classify the inability of AXCFR to
> transfer the zone as a weakness in BIND that could be addressed.
> Additionally, since it is known that zone transfers require a certain
> amount of bandwidth, Vixie could improve his operations by implementing a
> system that monitors the bandwidth with pathshow prior to intiating AXFR.
> Also, he could monitor the progress of the AXFR and also alarm if it was
> taking too long. This would have allowed a fallback to ftp sooner and
> operationally, such a fallback might even be something that could be
> automated. Of course, none of this means Vixie was at fault and I'd argue
> that NSI is at fault for not being able to detect the problem
> sooner and not being able to swap in a backup server sooner. Vixie knows
> that he is one of 13 root nameservers. But NSI knows that they are the one
> and only master root nameserver which puts more responsibility on them.

There have been no even remotely logical claims that f.root-servers.net
caused any problems at all.  If Paul's server had been working correctly
and had transferred the zone properly, the impact of NSI's screwups would
have been almost exactly the same.

What you are discussing is a problem, but not "the" problem and not a
problem that causes a significant impact over the short term.

It is important to keep that clear in messages; NSI has already spread
enough lies, so any confusion about the issue isn't wise.

In fact, the fact that at least three of NSI's servers were giving false
NXDOMAINs isn't really the issue either, from nanog's perspective.  It
needs to be figured out, is a major problem in BIND, etc. but isn't
necessarily something they could have or should have been able to prevent
before it happened: that is very difficult to figure out from the outside,
and I can certainly imagine situations where, despite the best operations
anywhere, they could not predict such things.

The big issue that needs to be addressed is why the heck it took NSI over
two hours after they were notified to fix it, especially in the middle of
the day, and why the didn't have any automated system that detected it and
notified them in minutes.  Whatever the exact problem was is important and
needs to be addressed, but addressing each instance is pointless without
knowing why NSI's operations procedures are so flawed.  In fact, they are
so flawed that the VP of engineering either had no idea what was going on
or chose to lie.

The problem is that NSI currently has no accountability (not even to their
customers), and doesn't even make a token effort to followup to their
screwups.

The organization that controls the root nameservers should have one of the
best operations departments, not one of the worst.