DNS anycast considered harmful (was: .ORG problems this evening)

Todd Vierling tv at duh.org
Thu Sep 18 13:57:23 UTC 2003


On Thu, 18 Sep 2003, Stephen J. Wilcox wrote:

: 1. Only you were affected

I doubt this.  At least one person has noted seeing the same on this list,
and I bet many more would corroborate by looking for DNS temp failures for
MAIL FROM:<*@*.org> in mail logs from last night between about 10:00PM
(GMT-4) and 11:30PM (GMT-4).

: 2. Only you have both servers going to the same place

This is NOT MY FAULT.  This is a flaw in the basic design of UltraDNS's .ORG
delegation.

I do, in fact, understand the purpose behind anycasting.  It is not a
failsafe redundancy scheme; it is, rather, a (geographic, ideally) traffic
distribution scheme based on BGP best-path selection.

The problem with UltraDNS, the point which many on this people are missing,
is that at least some UltraDNS sites are advertising *all* anycast networks
simultaneously (see traceroutes below).  Yes, all == 2 at the moment, but
this argument holds for any value of "all".

It is therefore possible (and was last night the case) that the same route
was chosen at one site for all UltraDNS anycast networks.  This produces,
effectively, a single point of failure from the perspective of that site --
and it is NOT that site's fault that its path selection happened to choose
the same route for all .ORG servers.

So I try to look up domains in .ORG, and all its the "servers" fail because
they all route to a dead site.  This is acceptable how?  This is my site's
fault how?

The correct way to fix this is to have more than just two networks -- and to
guarantee that no single physical location advertises *all* networks
simultaneously.  With that scheme, every site is guaranteed that at least
one of the anycast networks goes to a geographically different location from
the rest.

: Theres a theme in this, perhaps indicating where the problem may have been :)

gTLD operators should attempt to provide a degree of failsafe redundancy
that guarantees no site will select the same server cluster for *all* NS
records serving the zone.  Last night, a site did select the same
destination for all NS addresses, and a failure happened at that site,
causing DNS lookups for at least part of the Internet to fail.

===

Sample traceroutes from today, showing that at least one of UltraDNS's
locations is advertising all of their tld*.ultradns.net anycast networks at
once.  If the site where the "dellfweqch" is located goes dead to DNS, but
these networks continue to be available and selected by the host from which
I'm tracerouting, then DNS for .ORG at this site will be dead -- regardless
of how many other sites can see the zone.

traceroute to tld1.ultradns.net (204.74.112.1): 1-30 hops, 38 byte packets
...
 5  so1-0-0-2488M.br2.CHI1.gblx.net (67.17.71.82)  1.85 ms (ttl=250!)
 6  p1-6-3-0.r01.chcgil01.us.bb.verio.net (129.250.9.117)  1.17 ms
 7  p16-2-0-0.r01.chcgil06.us.bb.verio.net (129.250.5.70)  1.43 ms (ttl=251!)
 8  ge-1-1.a00.chcgil07.us.ra.verio.net (129.250.25.167)  1.71 ms (ttl=253!)
 9  fa-2-1.a00.chcgil07.us.ce.verio.net (128.242.186.134)  1.34 ms (ttl=251!)
10  dellfweqch.ultradns.net (204.74.102.2)  2.01 ms (ttl=60!) !H

traceroute to tld2.ultradns.net (204.74.113.1): 1-30 hops, 38 byte packets
...
 4  0.so-1-0-0.XL2.CHI13.ALTER.NET (152.63.69.182)  4.95 ms (ttl=251!)
 5  POS7-0.BR1.CHI13.ALTER.NET (152.63.73.22)  4.67 ms
 6  a11-0d114.IR1.Chicago2-IL.us.xo.net (206.111.2.73)  1.70 ms (ttl=251!)
 7  p5-0-0.RAR1.Chicago-IL.us.xo.net (65.106.6.133)  2.47 ms
 8  p4-0-0.MAR1.Chicago-IL.us.xo.net (65.106.6.142)  2.69 ms
 9  p0-0.CHR1.Chicago-IL.us.xo.net (207.88.84.10)  2.84 ms (ttl=248!)
10  *
11  dellfweqch.ultradns.net (204.74.102.2)  2.81 ms (ttl=60!) !H

-- 
-- Todd Vierling <tv at duh.org> <tv at pobox.com>



More information about the NANOG mailing list