QWest is having some pretty nice DNS issues right now

Steve Gibbard scg at gibbard.org
Sat Jan 7 02:54:52 UTC 2006


On Fri, 6 Jan 2006, william(at)elan.net wrote:

> On Fri, 6 Jan 2006, Wil Schultz wrote:
>
>> Apparently they have lost two authoritative servers. ETA is unknown.
>
> You forgot to mention that they only have two authoritative servers for
> most of their domains...

I didn't look at this while it was happening, and haven't talked to 
anybody else about it, so I don't know if this was a systems or routing 
issue.  But, in the spirit of trying to learn lessons from incomplete 
information...

Qwest.net and Qwest.com have two authoritative name server addresses 
listed, dca-ans-01.inet.qwest.net and svl-ans-01.inet.qwest.net.  As the 
names imply, traceroutes to these two servers appear to go to somewhere in 
the DC area and somewhere in proximity to Sunnyvale, California.  It 
appears they're really just two servers or single location load-balanced 
clusters, and not an anycast cloud with two addresses.  It may be that two 
simultaneous server failures would take out the whole thing, or they may 
be in less visible load balancing configurations.  Even if it's two 
individual servers, that's the standard n+1 redundancy that's generally 
considered sufficient for most things.

There is a fair amount of geographic diversity between the two sites, 
which is a good thing.

The two servers have the IP addresses 205.171.9.242 and 205.171.14.195. 
These both appear in global BGP tables as part of 205.168.0.0/14, so any 
outage affecting that single route (flapping, getting withdrawn, getting 
announced from somewhere without working connectivity to the two name 
servers, etc.) would take out both of them.

So from my uninformed vantage point, it looks like they started doing this 
more or less right -- two servers or clusters of servers in two different 
facilities, a few thousand miles apart on different power grids and not 
subject to the same natural disasters.  In other words, they did the hard 
part.  What they didn't do is put them in different BGP routes, which for 
a network with as much IP space as Qwest has would seem fairly easy. 
While it's tempting to make fun of Qwest here, variations on this theme -- 
working hard on one area of design while ignoring another that's also 
critical -- are really common.  It's something we all need to be careful 
of.

Or, not having seen what happened here, the problem could have been 
something completely different, perhaps even having nothing to do with 
routing or network topology.  In that case, my general point would remain 
the same, but this would be a bad example to use.

-Steve



More information about the NANOG mailing list