Journal of Internet Disasters
Paul Vixie
paul at vix.com
Fri Nov 13 17:03:34 UTC 1998
> This should go on the name-droppers list, but here goes....
these days it's not clear whether namedroppers is an operations list
or a protocol list or still both. i think nanog is a fine forum for this:
> What do we know about the events with the name servers
>
> - f.root-servers.net was not able to transfer a copy of some of
> the zone files from a.root-servers.net
> - f.root-servers.net became lame for some zones
just COM.
> - tcpdump showed odd AXFR from a.root-servers.net
just a lot of missed/retransmitted ACKs.
> - [fjk].gtld-servers.net have been reported answering NXDOMAIN to
> some valid domains, NSI denies any problem
the nanog archives include some dig results that are hard for NSI to deny.
> Other events which may or may not have been related
> - BGP routing bug disrupted connectivity for some backbones in the
> preceeding days
this turned up a performance problem in BIND's retry code, btw, but was
not otherwise related to the COM lossage of yesterday (as far as i know).
> - Last month the .GOV domain was missing on a.root-servers.net due
> to a 'known bug' affecting zone transfers from GOV-NIC
different bug. that one causes truncated zone transfers; the secondary
zone files on [fjk].gtld-servers.net yesterday were not truncated and it
just took a restart to make them stop behaving badly.
> - Someone has been probing DNS ports for an unknown reason
>
> Things I don't know
> - f.root-servers.net and NSI's servers reacted differently. What
> are the differences between them (BIND versions, in-house source
> code changes, operating systems/run-time libraries/compilers)
they are completely different systems (solaris vs. digital unix) running
the same (unmodified) bind 8.1.2 sources, which had completely different
failure modes for completely different reasons.
> - how long were servers unable to transfer the zone? The SOA says
> a zone is good for 7 days. Why they expire/corrupt the old zone
> before getting a new copy?
damn good question. i'll look into that. shouldn't've happened.
> - Routing between ISC and NSI for the preceeding period before the
> problem was discovered
there was asymmetry (they reached me via bbnplanet, i reached them via
alternet). they are now preferring alternet to reach me, so we have
better path symmetry now. but their first mile is still congested and
i am still retransmitting a lot of ACKs.
> Theories
> - Network connectivity was insufficient between NSI and ISC for long
> enough the zones timed out (why were other servers affected?)
other servers are more conservative, and had switched to manual daily FTP
of the COM zone longer ago than F has done. (with manual daily FTP you
get the advantages of gzip, and of the pretense of "zone master" status
while you manually retry after timeouts. AXFR needs those properties.)
> - Bug in BIND (or an in-house modified version) (why did vixie's and
> NSI's servers return different responses?)
there's definitely a bug in BIND if [fjk].gtld-servers.net were able to
return different answers after restarts with no new zone transfers. (i'm
sitting here wishing i had core dumps.)
> - Bug in a support system (O/S, RTL, Compiler, etc) or its installation
> - Operator error (erroneous reports of failure)
> - Other malicious activity?
i think there were a goodly number of procedural errors.
--
Paul Vixie <paul at vix.com>
More information about the NANOG
mailing list